Generalized discounted continuous-time Markov decision processes 
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Abstract: This article investigates the generahzed discounted criteria for possibly explosive continuous-time 
Markov decision processes (CTMDPs) in Polish spaces with unbounded transition and reward rates, which 
"_^' ■ allow the discount factors to be state-dependent and zero-valued, and thus cover the otherwise underdeveloped 
,__l I total undiscounted criteria for non-absorbing CTMDPs. Nontrivially, we, under very mild conditions, develop 
f^ ■ the transformation method, which reduces our CTMDP problems to their discrete-time analogues, and then 
C^ I show the existence of a deterministic (resp., randomized) stationary optimal policy for the unconstrained (resp., 
?H ' constrained) CTMDPs, where the optimality is out of the class of history-dependent policies. Moreover, the 
optimality equation for the unconstrained case is also established. Quite differently from the case of stan- 
dard discounted CTMDPs with a constant discount factor, we show that the transformation method could be 
inapplicable to the concerned CTMDPs with generalized discount factors when our conditions are violated. 

Keywords: Continuous-time Markov decision process, total reward criteria, non-absorbing case, state-dependent 
discount factors, transformation method. 
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In this article we consider a continuous-time Markov decision process (CTMDP) with generalized discounted 
criteria, allowing the discount factors to be non-constant and zero-valued, which, in particular, covers the 
^ . currently underdeveloped total undiscounted reward criteria for non-absorbing CTMDPs. 

\l ' Continuous-time Markov decision processes have rich applications to information transmission, queueing 

systems, epidemiology and so on, see the examples presented in [TH [211 HH [311 [32j. A common performance 

1^ ' measure of a CTMDP is the (expected) total discounted reward, which has been intensively studied by various 
authors, see [3 HH [HI HZl [IHl [23 130] for the recent developments, where constrained and unconstrained problems 
are both investigated, and different sufficient conditions for the existence of an optimal policy are given. Such 

^^ . solvability results are the main and fundamental objectives in the theoretical studies of CTMDPs, which are 
also of our primary interest. On the one hand, all those works consider a positive constant discount factor, 
whereas, the discount factor itself is often interpreted as the risk-free rate of return, which in practice may not 
be constant, on the other hand. Thus, a natural and meaningful extension of the standard discounted CTMDP 
should allow the discount factor to be non-constant. 

To the best of our knowledge, the first effort of studying CTMDPs with state-dependent discount factors is 
made quite recently in |36| . where the dynamic programming approach is justified for unconstrained problems, 
to which it is shown that there exists a deterministic stationary optimal policy. However, the optimality is 
only out of the class of (randomized) stationary policies, which is the main restriction of |36| . leaving alone the 
fact that the constrained problem is not considered therein. In fact, it is explained in the introduction of |36j 
that the difficulty of extending to the more general class of policies lies in that the appropriate versions of the 
Dynkin formula and the Kolmogorov forward equation for the CTMDPs with state-dependent factors have not 
yet been established, so that it is not clear how the methods of investigations in [TOl HZl [27] can be followed. 
Thus, an unsolved problem raises: 

• the topic of CTMDPs with state-dependent discounting and history-dependent policies remains uncovered. 
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It is worthwhile to point out another interpretation of the (constant) discount factor as follows. Indeed, one 
can regard a discounted CTMDP as one with an undiscounted reward (i.e., when the discount factor is zero) 
but also with an absorbing state, to which the process jumps with the intensity given by the discount factor, see 
[lOj . In this connection, we mention that the total undiscounted reward criteria for absorbing CTMDPs (i.e., 
the process falls in the cemetery in the state space within finite expected time under each policy) have been 
recently studied in [THl [H]. However, to the best of our knowledge, the case of non-absorbing CTMDPs with 
total undiscounted reward criteria has never been considered at all in the current literature. As a matter of 
fact, even in the discrete-time framework, constrained discrete-time Markov decision processes (DTMDPs) with 
total undiscounted rewards are very difficult to study, as explained in [4], see also the first successful treatment 
of such DTMDPs in the recent paper [S]. Thus, another unsolved problem is as follows: 

• for CTMDPs that are not necessarily absorbing, the total undiscounted reward criteria are still left 
unstudied, p 

In the present paper, for CTMDPs in Polish state and action spaces, we address both of the aforementioned 
two uncovered topics by allowing the policies to be history-dependent and the discount factors to be state- 
dependent and zero- valued, and derive the optimality results for both constrained and unconstrained problems. 

Our method of proof is based on reducing the CTMDP problem to an associated DTMDP problem, where 
the term "reduction" is understood to be possible if the two problems have the same value and an optimal policy 
for the DTMDP problem corresponds to an optimal policy for the CTMDP problem. For standard discounted 
CTMDPs, this transformation method has been fully justified in [TT], where the author transforms the CTMDP 
into a semi-Markov decision process (SMDP) and then a DTMDP, and shows that they all have the same 
occupancy measures p without imposing any conditions. In the present paper, we develop this idea as well. In 
this connection, the present paper improves and extends [IT], which is exclusively about standard discounted 
CTMDPs. However, such an extension is far not straightforward, see Subsection 15.31 Indeed, unlike in the 
standard discounted case, for the concerned CTMDPs under generalized discounted criteria, especially when 
the discount factors can be zero, the occupancy measures for the CTMDP can differ from those for the DTMDP, 
see Lemma l5.3f a). and, as a fact of matter, in general, there can be situations when the reduction completely 
fails to work, and non-stationary policies strictly outperform stationary policies, as to be seen in Examples 16.31 
and l6.4l To get over this difficulty, we suggest new and mild conditions, under which, we introduce a state-space 
decomposition argument into our analysis to show that the transformation method is still applicable. 

In greater detail, the main contributions of this paper are as follows (given that some standard compactness 
and continuity conditions are satisfied). 

1. Firstly, under the Lyapunov-type condition (i.e.. Condition 13.21 below), we show (in Theorems 3.2(b) 
and 3.3(b) below) the existence of a deterministic (resp., randomized) stationary optimal policy out of 
the class of history-dependent policies for the unconstrained (resp., constrained) CTMDP problem with 
positive state-dependent discount factors. Note that the Lyapunov conditions allow the reward rates to 
be unbounded from both above and below. 

2. Secondly, without any Lyapunov-type condition, when the rewards rates are arbitrarily nonpositive (equiv- 
alently, the cost rates are nonnegative) , we show (in Theorems 3.2(a), 3.3(a), and 3.4 below) the existence 
of a deterministic (resp., randomized) stationary optimal policy (out of the class of history-dependent poli- 
cies) for the unconstrained (resp., constrained) CTMDP problem with state-dependent discount factors 
(allowed to be zero-valued) and arbitrarily unbounded transition rates. 

3. Thirdly, we show that, if one of the following two new conditions, both of which allow the transition rates 
to take zero, are satisfied: 

(a) the action space is compact, the rewards rates are arbitrarily nonpositive, and a certain set of states 
is empty; 



^By the way, under the conditions imposed in 1361 . the discount factors, which read as a function in the current state of the 
process, are strictly positive and separated from zero (i.e., with a strictly positive lower bound), and therefore, the total undiscounted 
reward criteria are not considered and cannot be covered by 1361 . 

^See also 1301 . where the authors justify the transformation method using a different approach based on the Bellman (also called 
dynamic programming or optimality) equations. 



(b) for each state, either the transition rates are uniformly (with respect to the action) separated from 
zero with a positive fowcr bound independent on the state, or the discount factor at that state is 
positive (see Condition 13. 1|) . 



then the CTMDPs (with state-dependent and possibly zero-valued discount factors) can be reduced to 
DTMDPs, even though their occupancy measures may not be identically the same, and thus extend the 
transformation method suggested in [Tl]|30], which are exclusively about standard discounted CTMDPs. 
Moreover, what is interesting and insightful, we prove the general inapplicability of the transformation 
method for the CTMDPs considered in this article when our conditions (see the aforementioned two) are 
violated, see Examples 16.31 and [ 



A more detailed discussion on the relations between this paper and the literature can be seen in Remark 13.11 
below. 

Another interesting remark, among those scattered over the main text below, is as follows. Throughout this 
paper, the optimality is out of the class of history-dependent policies, and the imposed conditions always allow 
unbounded transition and reward rates; such CTMDPs were called challenging in the survey [15| . However, 
examples presented in Section |6] indicate that for CTMDPs with possibly zero- valued (state-dependent) discount 
factors, those with history-dependent policies and small transition rates not separated from zero could be more 
challenging. This observation suggests completely new sufficient conditions for tackling such CTMDPs, see 
Condition 13.11 and the conditions in Theorem 13.41 

The rest of this article is organized as follows. We describe the construction of the CTMDP under history- 
dependent policies, and formulate the concerned (state-dependent) discounted optimization problems in Section 
[21 The main optimality results are stated in Section [3] To facilitate the proofs and improve the readability, 
under the conditions that are relaxed later, we establish some preliminary results for the reduction of CTMDPs 
to SMDPs and DTMDPs in Section 21 The proofs of the main statements are collected in Section [5l In Section 
iniwe give examples to illustrate the applications of the obtained optimality results and the technical roles played 
by our conditions. Incidentally, Example 16.31 shows the insufficiency of the class of embedded Markov policies, 
which contains the class of stationary policies, for CTMDPs with total undiscounted nonpositive reward criteria, 
see Remark Wl] This paper ends with a conclusion in Section [T] 

2 Kitaev's construction and optimization problem statement 

Notations and conventions. In what follows, / stands for the indicator function, 5x{-) is the Dirac measure 
concentrated at x, and B{X) is the Borel cr-algebra of the topological space X. The abbreviation s.t. (resp., 
a.s.) stands for "subject to" (resp., "almost surely"). Below, unless stated otherwise, the term of measurability 
is always understood in the Borel sense. Throughout this article, we adopt the conventions of ^ := 0, • oo := 
and ^ := -|-oo. 

The primitives of a CTMDP are the following elements {S,A,{A{x) C A,x ^ S),q{-\x,a)}, where S is 
a Polish state space, A is a Borel action space, and the multifunction A{x) : x i— > A{x) £ 13(A) (x S S), 
specifies the admissible action spaces, for which we assume that A{x) £ B{A) for each x £ S, its graph 
K := {{x, a) : X £ S,a G A{x)} belongs to B{S x A) and contains the graph of at least one measurable mapping 
from 5* to A. The transition rates are given by q{-\x,a), a signed kernel on B{S) given (a;, a) G K such that 
q{Ts \ {x}\x,a) > for all Ts £ B{S). Throughout this article we assume that q{-\x,a) is conservative and 
stable, i.e., q{S\x,a) = and qx = sup ^^^ y^r ^-^ qx (a) < oo, where qxia) := —q{{x}\x,a). 

Following the Kitaev construction of a CTMDP [23], we take the sample space 17 := S* x ((0, oo] x 5*00)°°, 
where Soo := 5'lJ{xoo} with the isolated point Xoo ^ 5* accounting for the cases of finitely many jumps over the 
infinite time horizon and infinitely many jumps over a finite time interval. We equip SI with its Borel cr-algebra 
T. For each n > 0, and any element w := {xq, di,xi,92, ■ ■ ■) G SI, let 

i„(w) := i„_i(w) + 6'„, with io(w) := 0, 

and 

ioo(w) := lim t„{uj). 

n—^csD 

Obviously, tn{Lu) are measurable mappings on the sample space fl. In what follows, we will omit the argument 
to € ft from the presentation for simplicity, and understand i„, a;„, 0n+i, and too as the n-th jump moment, 



jumpped-in state, holding time of a;„, and the explosion moment. Also, we will regard x„ and 9n+i as the 
coordinate variables, and note that the pairs {i„,a;„} form a marked point process with the internal history 
{J^t}t>o (see Chapter 4 of [24|)- The marked point process {tn,Xn} defines a stochastic process on {i^,J-) of 
interest {£,t,t > 0} by 

6 = X! -^'f^" - ^ ^ tn+l}Xn + I{too < t}Xoo, (1) 

n>0 

where Xoa is a cemetery point so that A(xrxi) '■— {ftoo} and qx^io-oc) '■— with Aqo ^ ^ being some isolated 
point. Below we denote A^o := ^Ul'^oo}- 

Definition 2.1 A (randomized history-dependent) policy tt for the C'TMDP is given by a sequence (7r„) such 
that, for each n ~ 0,1, . . . , 7r„((ia|xo, 6i, . . . , x„, s) is a stochastic kernel on A concentrated on A(xn), and for 
each oj ~ (xq, 9i,xi,92, ■ . ■) ^ fl, t > 0, 

TT{da\u!,t) ~ /{t = 0}7ro((ia|xo, 0) 

+I{t > too}Sa^{da) + ^I{tn < t < tn+i}TTn{da\xo,9i,.. .,a;„,i- t„). (2) 

In other words, a policy tt is a predictable (with respect to {J-t}t>o) stochastic kernel from H. x [0, cxd) to ^oo, 
see Theorem 4.19 in [Mj. The class of all policies for the CTMDP is denoted by TIctmdp- 

Under a policy ir := (7r„) S TIctmdp, we define the following random measm-e on 5" x [0, oo) 



v^dt,dy) := / q{dy\{itA^)}\S,t-H,a)TT{da\Lo,t)dt 

J A 
= X! / <l(.dy\{Xn}\Xn,a)TTn{da\xQ,9i,...,Xn,t-tn)I{tn<t<tn+l]dt (3) 

with q{dy \xoo,0'oo) '■= 0. Suppose that an initial distribution 7 on S* is given. Then by Theorem 4.27 in [24], 
there exists a unique probability measure PJ' on {^,J-) such that 

p;(6edx) = 7(dx), 

and with respect to PJ' , v'" is the dual predictable projection of the random measure 

ii{dt, dy) := ^ /{t„ < cx)}/{a;„ e dy}I{t,, G dt}. 

n>l 

This gives rise to the desired stochastic basis {il, T , {J-t}t>o, PZ\ always assumed to be complete. The process 
{^t} defined by ([T|) under the probability measure P^ is called a CTMDP. 

Below, when 7(-) is a Dirac measure concentrated at a; S S", we use the "degenerated" denotation P^ . 
Expectations with respect to P^ and P^ are denoted as E'^ and E'^, respectively. 

We point out that since the random measure \x completely characterizes the marked point process {t„,x„} 
and thus t,t |22| . h'^{dt,dy) uniquely defines the processes {tn,Xn} and S,t. 

Remark 2.1 Under the probability measure P^ , the system dynamics of a CTMDP can be described as follows. 
The initial state xq has the distribution given by 7. Given the current state Xn, the sojourn time 9n+i has the 
tail function given by 



P^{9n+l >t\xa,9i, . . . ,Xn) -- e- i'^ iAl^^i-^)^r.{da\xo,e^,...,x^,s)ds ^ 

and upon a jump, the distribution of the next state Xn+i is given by 

J^ q(T \ {xn}\a)Trn{da\xo, 6*1, ... , a;„, 6'„+i) 



P^{Xn+i e r|xo,6'i,.. .,Xn,9n+l) = 



Ja 92;„ {a)TTn{da\xQ, 6*1, . . . , CC„, 9n+l) 



for each T G B{S), where and below we quite formally put J. (j(r\ {.T„}|a)7r„((ia|.To, ^1, . . . ,x„,oo) := for each 
r G I3{S) and 7r„({aoo}|a;o, 9i, . . . , x„, 00) := 1, and use the convention of ^ := 0, so that 

P^{Xn+l = Xao\xo,9i,...,Xn,9n+i) = f - P!'(a:„+i G S\xo, 9i, . . . , Xn, 9n+l) ■ 



Let a{x) > be a measurable function on S, representing the discount factor if the current state is x; and 
ri{x, a), z = 0, 1, . . . , A^ with N being a fixed nonnegative integer, be measurable functions on K, representing the 
reward rates. Quite formally, for any measurable function / on B(K), we put f{xoc,cioo) = 0- This agreement, 
together with ([T]) and that q{{xoo}\xoo,aoo) = = ^^^(aoo), allows one to write 



Wi-f,7:,f^) := E; 



- e; 



J A 

oo /> 

J A 



(4) 



where f~^{x,a) := max{0, /(x,a)} and / (a;, a) := max{0, — /(x,a)}. Thus, the expected total discounted 
reward (with state-dependent discounting) reads 

Wi^,7r,r,):=W{-f,TT,r+)-Wi^,7r,r^), iov i ^ 0,1, . . . ,N, 

where and below, as in [7], the convention of co — oo := — oo is always in use. In fact, throughout the present 
paper, this convention is only relevant to Theorem 13. If a). 
We are interested in the following optimization problem 

W{'-f,TT,ro) ^ max (5) 

s.t. Wi-f,TT,r,)>d,, j = l,2,...,N, 

where dj € (—00,00) for each j = 1,2, . . . , N, and 7 is the fixed initial distribution on S. We denote by 
^CTMDP — ^CTMDP the class of policies satisfying the constraints in problem ([S]). To avoid trivial situations, 
we assume I^ctmdp t^ ^ throughout this article. Denote by 

W{j) := sup W {'-f , TT , ro) ■ 

Then a policy n* G ^ctmdp i^ called constrained-optimal for problem ([5]) if M^(7, 7r*,ro) = W{j). 

When iV = 0, we put 11 ctmdp ~ ^ctmdp, and problem ([5|) is reduced to an unconstrained problem, 
which we also consider; in this situation, we do not fix the initial distribution. More precisely, the problem 
reads 

W{5x,TT,ro) -i- max , (6) 

ttGYIctmdp 

for which a policy tt* e I^ctmdp is called optimal if W{dx, tt*, rg) — WlS^) for every x € S. 

In what follows, we write W{6x,T^,ri) as W{x,Tr,ri) for convenience. 

We are also interested in policies in specific forms. A policy n = (7r„)„=o.i,... G ^ctmdp is called (random- 
ized) embedded Markov (resp., stationary) if each of the stochastic kernels 7r„ reads, with slight but conventional 
abuse of denotations, 7r„((ia|a;o, Q\,.-. ,x„,i — 1„) = 7r„((ia|a;„) (resp., 7r„(da|a;o,0i, . . . ,Xn,t — t„) = Tr{da\xn))- A 
stationary policy is further called deterministic if 7r„((ia|a;o, 9i, . . . ,Xn,t — tn) = (5,^(a;^)((ia) for some measurable 
mapping ip from S to A such that ip{x) G A{x) for each x G 5; the existence of such a mapping is guaranteed 
by the assumption imposed on the multifunction A{-), which also implies the set Hctmdp being nonempty. 

Remark 2.2 Note that in the literature, see, for example, \14\ \15\l . a policy tt = (7r„)„=o,i.... for a CTMDP 
with 7r„(cia|xo, ^1, . . . ,Xn,t — tn) = 'K{da\xn,t) is called a (randomized) Markov policy, which is different from 
the embedded Markov policy defined above u. We also point out that an embedded Markov policy for a CTMDP 
is exactly a Markov one for an DTMDP. While it is known in J^ that the class of Markov policies is sufficient 
for DTMDPs with total undiscounted reward criteria (or even the more general so called-decomposable criteria), 
Examvle \6.S\ and Remark \6.1\ below show that the class of embedded Markov policies is insufficient for CTMDPs 
with total undiscounted reward criteria. 

Finally, as a general remark about the terminology, randomized policies for CTMDPs could be well called 
relaxed policies, since they can be understood as deterministic policies when the action space is relaxed to 
be the collection of all the probability measures on it, as explained in [TT] and p. 139 in the monograph |24) . 
However, in the present paper we have decided to follow the convention of calling such policies "randomized" 
as in m m [TCI HZl Ha HZl Ea EB for consistency. 

•^Indeed, that is why we use the term "embedded". 



3 Optimality results 

In this section, wc state the conditions, the associated optimahty resuhs, as well as some interesting remarks. 

Condition 3.1. There is some constant q > such that for each a; G S", either a{x) > 0, or inia^Aix) Qxio-) > 

1- 

Condition 13.11 is new and of interest, as discussed in Remark 13.11 with the details. In fact, it validates the 
calculations in ([TS]). and in general cannot be relaxed to a{x) + qx{a) > for each x ^ S and a E A{x), 
see Example 16.21 below. Also, Condition 13.11 plays an important role in the proofs of Theorems 13.11 and 13.31 
as explained by means of examples in Section [6] below. On the other hand, at the cost of some additional 
conditions on the action space, we can still omit Condition 13.11 for the optimality results, see Theorem 13.41 
below; this modification is not straightforward. 

Condition 3.2. 

(a) There exists a measurable function w{x) > 1 on S, and constants c G (—00,00), b > 0, M > such that 

Jg w{y)q{dy\x, a) < cw{x) + b for all a S A(x) and x € S, and q^ < Mw{x) for all x E S. 

(b) A(x) is a compact set for each x Cz S . 

(c) For each Ts G 13(3), q{Ts\x,a) is continuous in a E A{x), qxia) is continuous in a E A(x), and r(x,a) is 

upper semicontinuous in a E A{x) for each x £ S. 

(d) J„ w{y)q{dy\x^ a) is upper semicontinuous in a G A{x) for each x € S. 

(e) There exist a measurable function Wi{x) > I on S and constants ci e (— cx3,oo), 61 > 0, Mi > such that 

{q^ + l)wi{x) < Miw{x), and J^ wi{y)q{dy\x, a) < CiWi{x) + 61 for each x- G 5 and a G A{x). 

(f) There exist constants M2 > and P > such that \ri{x,a)\ < Al2Wi{x) for all a G A{x), x E S,i = 0, . . . , N , 

and max{c, ci} < /3 < a{x) for all x £ S . 



Condition 13.21 is a Lyapunov-kind, and validates the relevant results in, for example, [27] to be used in the 
proofs of some but not all of the main statements below. Its part (a) ensures the non-explosion of the process 
^t under each pohcy, i.e., P^{too = cxd) = 1 for each tt G Tier m dp and x £ S. Parts (b-d) of Condition 13.21 are 
standard (strong) continuity-compactness conditions needed for the existence of optimal policies, and they are 
similar to those imposed in [T31 HH [IZl [27j for standard discounted CTMDPs. Condition l3.2l fe.f) is of technical 
nature. We finally remark that Condition 13.21 can be satisfied by CTMDPs with unbounded transition rates 
and unbounded (from both and below) reward rates. 

In order to present the next theorem and improve the readability, we recall some definitions. They are in 
technical nature, and can be skipped without affecting the understanding of the reasoning and most of the results 
in this paper. A function / defined on the Borcl space S is called upper semianalytic if for each c G (—00, 00), 
{x £ S : —f(x) < c} is an analytic set, which is, by definition, the nucleus of some Suslin scheme on B{S). A 
Borel-measurable subset of S is always an analytic subset. A function / on the Borel space S is called universally 
measurable if it is measurable with respect to the universal cr-algebra Us defined by Us = Clp^-pis) ^p{S), where 
V{S) represents the collections of (Borel) probability measures on S, and Bp{S) is the completion of B(S) with 
respect to p G P{S), see more details in Chapter 7 of [3]. 

The next theorem establishes the Bellman (optimality, or dynamic programming) equation and a value 
iteration algorithm for the value function of the CTMDP problem (|5]) with generalized discounting. 

Theorem 3.1 Consider problem (0j, and recall its value function W{x) = ^^PTrencrA/DP W{x,TT,ro), which is 
[—oo,oo]-valued. Suppose Condition \3.1\ is satisfied. Then the following assertions hold for the function W. 

(a) The function W is universally measurable in x £ S. 

(b) If for each x £ S and policy tt G ^ctmdp, either W{x,Tr,rQ) < 00 or W{x,Tr,rQ) < 00, then W satisfies 

the Bellman (also called dynamic programming, or optimality) equation 

W{x) = sup I /f";'"^, , + . ,1 . , I W{y)q{dy\x,a)\ , x £ S. (7) 

aeA{x)[a{x)+qx{a) a{x) + qa:{a) J s\{^] J 



(c) Ifro{x,a) > for all x ^ S and a G A{x), then W is the minimal nonnegative upper seminanalytic 

solution to (f^, and can be obtained through the value iteration algorithm W{x) = liinfc_>.oo Wk{x), where 
Wo (a;) := 0, and 

Wk+iix) -.^ sup < /? ^'°/ X +—r^-, TT / Wk{y)qidy\x,a)} ioT k>0; 

(d) If for each x G S and a G A{x), ro{x,a) < 0\j then W is the maximal nonpositive upper semianalytic 

solution to n^. 

(e) // Condition \3.2\ is satisfied, then W{x) is the unique measurable function satisfying ^, where the unique- 

ness is out of the class of measurable functions u on S satisfying sup^g^ w (x) ^ ^-^' '^^'^ W(x) can be 
obtained via the value iteration algorithm W{x) = linifc_j.oo Wk{x) with 

, . M261 M2 , , 

P[p - ci) p - Cl 

Wfc+i(x) := sup I 7?^";'°^. . + . .^ . . / Wkiy)qidy\x,a)\, k>0. 
a€A{x) [oi{x) + qoo(a) a(x) + q^^{a} J g\[^-^ J 

The proofs of this and the other main statements in this paper are postponed to Section [5] 

The next statement provides several sufficient conditions for the existence of a deterministic stationary 
optimal policy for problem ([6|) out of the class of history-dependent policies. 

Theorem 3.2 Consider problem 0). Suppose Condition \3.1\ is satisfied. Then the following assertions hold. 

(a) // Condition \3.2V b.c) is satisfied, ro{x,a) < for each x Cz S,a Cz A{x), and the multifunction x — >■ A{x) is 

separable^, then there exists a deterministic stationary optimal policy. 

(b) // Condition \3.2\ is satisfied, then there exists a deterministic stationary optimal policy. 
Now we turn to the constrained problem ([5]). 

Condition 3.3. With the functions w and wi as in Condition \3.Sl 

(a) A(x) = A for each x £ S. 

(b) For any bounded continuous function f{x) on S, Jg f{'y)q{dy\x,a) is continuous in {x,a) £ K. 

(c) qx{a) is continuous in {x,a) G K, and a{x) is continuous in x G S. 

(d) ri{x,a) (i ~ 0,1, . . . , N ) are upper semicontinuous in (x, a) G K. 

(e) sup^gg "^f X < 00, J„w{x)^{dx) < 00, the function wi is continuous, and there exists an increasing 

sequence of compact sets Kn t K such that lim„^oo ^'<^i{x.a)eK\K„ w (x) ~ '^^■ 

(f) qx(a) ^ continuous in (x,a) G K. 

\ ' a{x)+q^{a) ^ ' ' 

Condition I3.3f b-d) is a standard (weak) continuity-compactness condition and similar to those in [T71 [57] for 
constrained discounted CTMDPs. By the way, we point out that it follows from Lemma 3.10 of [37] that 
Condition 13. 3f c') implies Condition 13. 2r b'). Part (a) of Condition 13.31 is further imposed to validate the results 
obtained in [5] for DTMDPs, which are needed in the proofs of the forthcoming theorems, while part (f ) is only 
needed when Condition 13.11 docs not hold. 

Theorem 3.3 Consider the constrained problem |^. Suppose Condition \3.1\ is satisfied. Then the following 
assertions hold. 



*As usual, maximizing a negative reward is equivalent to minimizing a positive cost. 

^That means, A contains a countable dense subset A' such that A' f] A{x) is dense in A{x) for each x S S; and {x £ S : a G 
A(x)} G B{S) for each a e A, see [33]. 



(a) Suppose Conditions \3.S^ b) and \3.3\f a-d) are satisfied. If ri{x,a) < for each i = 0,1,..., A^, and there 

exists a feasible policy tt G ^ctmdp such that W{'^,TT,ro) > — oo, then there is a randomized stationary 
constrained- optimal policy. 

(b) Suppose Conditions lS.S^ a. e. f) and \3.3]f b-e) are satisfied. If problem ^ is feasible, then there is a random- 

ized stationary constrained- optimal policy. 

The results stated in the above are aU based on Condition 13. 11 which is essentially new as compared to those 
in the Uterature. In order to appreciate its interest, we make the foUowing remark. 

Remark 3.1. 

(a) // there exists some q > such that qx{a) > q for each {x,a) G K, then one can consider the case 

of a{x) ~ for each x G S, and thus Condition \3.1\ covers strictly non-absorbing CTMDPs with total 
undiscounted reward criteria. We point out that for constrained DTMDPs, the total (undiscounted) reward 
criteria are very difficult for investigations CT, and they are mainly studied under the absorbing condition 
(i.e., the expected time for the DTMDP to be absorbed at some cemetary set is finite under each policy), 
see in Uiil \21^ . In comparison, for constrained CTMDPs with total reward criteria, even under suitable 
conditions guaranteeing the process to visit the absorbing set within finite expected time, upon which the 
transition rates are all zero, the investigations are conducted not long ago, see \1^ I JP) /. In this connection, 
our Condition \3.1\ and thus the optimality results derived in the above are essentially new and cover the 
underdeveloped non-absorbing case. Besides, we will even show that the new Condition \3.1\ can be further 
withdrawn if the reward rates are nonpositive. 

(b) In case a{x) > for each x £ S, Theorems \3.1l \3.2\ and \3.3\ are extensions of the results in Ull \17\ 

[^ r?^ . where the discount factors are positive constants, and \36f . where only stationary policies are 
considered. In fact, examples in Section\^ show that great technical difficulties could arise if one consider 
history-dependent (thus non-stationary) policies. Note also that our conditions allow unbounded transition 
rates, and in particular, when the reward rates are nonnegative, the transition rates could be arbitrarily 
unbounded, and thus our results are applicable to explosive CTMDPs, too, whereas only non- explosive 
CTMDPs are considered in the majority of the literature on this topic, see \13l \14\ U6l \1'T\ \27\ \31f . 

As promised, the next optimahty result does not require Condition 13.11 On the other hand, it is based on 
the compactness of the action space A, the nonpositivity of the reward rates, as well as the emptiness of a 
certain set, i.e., 5*1 \ 5*1 = 0, where 

f / N \ ^ 



Si :~ i X £ S : a(x) + inf qx(a) ~ 0, inf qx(a) — } r-Ax, a) I > 



i=0 



B{x) -{aeA: qx{a) = 0}, (8) 



5i := < a; £ S'l : sup qx{a) = 
L aeA 

Their intuitive meanings are postponed to Remark 15.21 below. 

Theorem 3.4 Suppose Conditions \3.S^ b) and \3.3]f a-d.f) are satisfied, and r^ < for each i = 0,1, ..., N. In 
addition, Si\ Si ^ 0. Then, the following assertions hold. 

(a) Consider the unconstrained problem ^. If for each x £ S, there exists a policy n £ Hctmdp such that 

W{x,Tr,ro) > — oo, then the value function W is the maximal nonpositive measurable solution to equation 
1^, and there exists a deterministic stationary optimal policy. 

(b) Consider the constrained problem {^j- If there exists a feasible policy tt £ ^ctmdp such that W{'f,Tr,ro) > 

— oo, then there is a randomized stationary constrained- optimal policy. 



The proofs of all the previous optimality results, as presented in Section [5l are based on reducing the CTMDP 
to an equivalent SMDP and then an equivalent DTMDP defined in Corollary 15. II 

Regarding our contribution to this transformation method, we have the following remark. 

Remark 3.2 In the majority of the literature, the transformation method is studied based on either relatively 
simple policies or hounded transition rates. To the best of our knowledge, for CTMDPs with positive constant 
discount factors, unbounded transition rates and history- dependent policies, this method is only established in 
Ul[ \30\l not long ago. In comparison, under very mild conditions, the present paper further extends this trans- 
formation method to CTMDPs with varying discount factors, which, in particular, could be zero-valued, and 
this extension is far not straightforward since in this case, the occupancy measures of CTMDPs and those of 
the SMDPs and DTMDPs can he different, see Lemma \5.3Y a} below. Furthermore, as to he seen, we will show 
by means of an example that if our conditions are violated (in particular, if the discount factors are zero, the 
transition rates are not separated from zero, and the action space is non- compact), then this transformation 
method could he inapplicable, see Examvle \6.3\ and Remark \6.1\ below. In this connection, one can say that the 
result of Theorem \3.4\ is quite new and of high novelty. 



4 Auxiliary results on SMDPs, DTMDPs and CTMDPs 

In order to prove the main results, we need to provide some facts about SMDPs, DTMDPs, and CTMDPs, which 
are briefly presently in a separate section due to their independent interest and self-closedness. In particular, 
we show in Theorem 14.11 below that an SMDP is equivalent to a corresponding DTMDP in an appropriate sense 
without imposing any conditions. Since we allow the discount factors for the possibly explosive SMDP to be 
non-constant and can take the value of zero, the results are extensions of those in [5], which exclusively concern 
standard discounted SMDPs only. In particular, they together with the results in [S] for DTMDPs allow one 
to derive the optimality results for constrained SMDPs with total undiscounted reward criteria; since our main 
interest is in the more complicated case of CTMDPs, such details are omitted. 

4.1 Construction of SMDPs and DTMDPs 

In line with [8], see also Chapter 11 of [32], an SMDP is specified by the following primitives {S,A, {A{x),x € 
S), Q{dy, dt\x, a)}, where 5 is a Borel state space, ^4 is a Borel action space, the multifunction A{x) C A {x £ S) 
specifies the admissible action spaces. It is assumed that the multifunction A{-) is with a measurable graph 
K := {(x, a) : x £ S,a d A{x)} and uniformizable, i.e., the set K contains the graph of at least one ineasurable 
mapping from S to A. Q{dy, dt\x, a) is a sub-stochastic kernel on S' x [0, cxd) given {x, a) G K, representing the 
joint probability distribution of the sojourn time leading to the next jump and the jumped-in state given the 
current state x and action a £ A(x). 

To define policies for an SMDP, we introduce the sample space Hao := S x ( Aao x [0, c«] x Soo ) endowed 

with its Borel cr-algebra B{H,^), where 6*00 '■— "SlJIa^oo}, A^o '.— A[J{aoo} with I'oo ^ S and 0^0 ^ A being two 
isolated points. Accordingly, we also extend Q{-, -j-, •) such that Q{{xoo} x {oo}|a;oo,aoo) — 1 and Q{{xoo} x 
{oo}|a;,a) = 1 — Q{S x [O,oo)|a;,a) for each {x,a) € K; in this way, Q{-, ■{■, ■) becomes a stochastic kernel on 
[0, 00] X Soo given x G 5*00 and a € A{x), with A{xqc) = {ooo}- Denote by h„ := (xq, ao,6'i, xi, 0,1,62, . ■ . ,in) € 

5 X I Aoo X [0, 00] X Soo ) '■— Hn for each n ~ 0,1, . . . . Then, a policy n for an SMDP is a sequence (7r„)„=o,i,... 

with 7r„((ia|/i„) being a stochastic kernel on A^o given ft,„ G if„ for each n = 0,1,... such that 7r„((ia|/i„) 
is concentrated on A{xn). Denote by Hsmdp the collection of all policies for the concerned SMDP. A policy 
TT = (7''n)n=o,i.... 6 ^SMDP IS Called (randomizcd) Markov (rcsp., stationary) if each of the stochastic kernels 7r„ 
reads, with slight but conventional abuse of denotations, 7r„((ia|/i„) = 7r„((ia|a;„) (resp., 7r„((ia|/i„) ~ 7r{da\x„)). 
A stationary policy is further called deterministic if T:{da\x) = S^j^r^Jda) for some measurable mapping ip from 
Soo to Aoo such that ip{xoc) = o-oo and ^(x) G A(x) for each x £ S. 

As in the construction of a CTMDP, let us introduce the following measurable mappings on Hao', to :~ 0, t„ := 
tn~i+0n and too '■= lini„_j,oo tn- Given an initial distribution 7 on 5* and a policy ir = (7r„)„=o,i....i one can apply 
the lonescu-Tulcea theorem for the existence and uniqueness of a probability measure P.5 on {H 00, 13{II 00)) such 
that P^(xo G dx) = ^{dx), PI^{dn G da\h„) = 7r„((ia|/i„) and P^(i„ G dy,9n+i G dt|/i„,a„) = Q{dy,dt\x„,an). 
The corresponding expectation operator is denoted by E'", and when 7({a;}) = 1 for some x G S* we use P^ 



and E^ as P^ and EJ^, respectively. Finally, the stochastic process on the probability space I Hoo, B{Hao), P^ ) 

defined by the same formula as ([1]) with t„, too and Xn being replaced by f„, too and a;„ is called an SMDP. 

If Q{S X {l}|a:;,a) — 1, that is, the sojourn times are degenerate and identically equal to one, then we 
write Q{dy x {l}|.T,a) = p{dy\x,a), and introduce the model {S,A, {A{x),x S S),p{dy\x,a)} of DTMDPs. For 
the construction of such an SMDP, we can exclude the (deterministic) sojourn times from our consideration 
and simply take the sample space Hoo '■= S x {A x S)°° , and define a policy tt = (7r„)„=o,i,... as a sequence 
of stochastic kernels 7r„((ia|/i„) with hn = (ig, aoi^i, ^i: • • ■ jin)- The collection of such policies is denoted by 
Hdtmdp- Markov and stationary policies for DTMDPs are defined in the same way as for SMDPs, and we We 
denote by ^^tmdp ^^^ class of (randomized) Markov policies for DTMDPs. Under a policy tt G ^dtmdp^ 
as described in the above for the construction of the SMDP, or see e.g., [501 [H]j one can apply the lonescu- 
Tulcea theorem to construct the probability and expectation on -ffoo, which are denoted by P and E . The 
discrete-time stochastic process Xn under P is called a DTMDP. 

4.2 Equivalence between SMDPs and DTMDPs 

Let a{x) > be a measurable function on S. Quite formally we put ci(a;oo) = 0. 

For the fixed initial distribution 7 on S, policy tt G TIsmdp and nonnegative integer n — 0,1,2,..., we 
define the occupancy measure on B{S x A) for the SMDP by 



M;jdv,da) 



e; 



e-i:r=o'"(5.)«.+i/{i„ e dy,iL,, e da} 



(9) 



The marginal of M^ j^{dy,da) on S is denoted by 



rh^Jdy) = Aqjdy X A). 



(10) 



In particular, we have 



rh'l o(dy) = j{dy) for any w e Hsmdp- 



Note that M^ ^{dy, da) and rhl^ „(dy) are finite (indeed sub-stochastic) measures. Then direct calculations based 
on the construction of the SMDP lead to the following relation between m!^ „^2 and M!^„ 






+i{dx) = e; 



e; 



=-ii;?=o"(^ 



^^I{Xn+l e dx} 



'^n J ^n 



= e; 



e; 



En — 1 ~ / 



+'e:; 



^-Q(s„)e„+i 



I{Xn+l G dx} 



'^n J ^n 



e °'^''"^*Q{dx,dt\xn,a„ 



K 



e-''<'y^'Q{dx,dt\y,a) ) M:;„{dy,da) 



(11) 



for any ri = 0, 1, . . . , and tt G TIsmdp- 

Furthermore, if tt = cr = {(Jn)n=o.i,... is a Markov policy for the SMDP, we can relate A£5„_|_j^ to m°^ „^i with 



M^, n+iidx, da) 



EZ 



EZ 



-Er=o"(-.)e.+i/{i„^^ G da-}/{a„+i G da} 



'■n+l 



E^^ 



where n 



-1,0, 



e ^'"=o"(*')^'+i/{i„+i ^ dx}an+i{da\xn+i) 
= (Jn+i{da\x)nf^.^j^^{dx), 

Both ([TT]) and p^ are repeatedly used below. 



(12) 



Lemma 4.1 For any policy tt G ^smdp o,nd initial distribution 7, there is a Markov policy a — {(Jn)n=o,i,... S 
IIsA/DP such that 



MJ^„(r) = Af^„(r), for each F G B(S' x A) and n = 0, 1, . 



(13) 
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Proof. Let 71 = 0, 1, . . . be fixed. Since All' ^{S x A) is finite, Proposition D.8 in [50] gives the existence of a 
stochastic kernel cr„(da|a;) from S to A concentrated on A{x) for each x ^ S, such that 

M^„^{dx,da) = an{da\x)m^„^{dx), (14) 

and it is tlic unique one in the to!^ „-a.s. sense. Clearly, being produced in this way, the sequence {(Jn)n=o.i,... 
defines a Markov policy cr, for which we verify the equality to be proved by induction with respect to n as 
follows. Since to!J o((ia;) = m^ o((ia;) = j{dx) for every policy tt, it follows from this and the definitions of o-q 
that M^ ^{dXjda) = M^ ^{dXjda), i.e., the case of n = is verified. Now suppose (fT5)) holds for the case of 
n — k. It immediately follows from this inductive supposition and relation (|lip that m'" i^,i{dx) ~ fh" ^,^(dx\ 

This and (fT2)) then imply M^j._|_^(da;,da) = crfe+i(da|2:)m^ „_,_j(d2:) = M^ ^.j^^(dx , da) , where the last equality is 
by the definition of the Markov policy a. Thus, P^ holds for the case of n = A; + 1, and the statement of this 
lemma follows from the induction. □ 

The next theorem establishes the equivalence between SMDPs and DTMDPs. 

Theorem 4.1 Given the SMDP defined in Subsection 4-1, consider the following DTMDP with the state space 
Soo, action space Aao, admissible action spaces A{-), and the transition probability p(dy\x,a) given by 

p{dy\x,a)= e~''^''^*Q{dy,dt\x,a), p{{xoc.}\x,a) ^ 1 - p{S\x,a), p{{xcx,}\xoc,aao) = I 



■Jo 
for each dy G B(S), x G 6*00 and a G A^o- Then 

(a) for any policy tt G ^smdp, there is a Markov policy a G H-dtmdp such that 

M^^„(rs X r^) = p^(i„ G r§, 5„ g r^) for each r^ X r^ G Bis x i), n = O, l, . . . ;[1 

(b) for any policy tt G ^dtmdp, there is a Markov policy a G ^smdp such that 

P'^jiin er^,a„ G r^) = Af^,„(r5 x r^) for each r^ xr^GS(^xi),ri==0,l,... 

Proof. Note that a policy a is Markov for the SMDP if and only if it is Markov for the DTMDP, and moreover, 
it follows from (fTTT) . ([T^ and the construction of the DTMDP that restricted on S x A, P [xn G dx, a„ G 
da) = M^„(dx, da). Then, the statements of this theorem follows from Lemma HTT] and the well known fact that 
for any policy tt G ^dtmdp, there exists a Markov policy a G Hdtmdp such that Pj{xn G dx,dn G da) = 
P^{xn G dx, a„ G da), see Lemma 2 in [35], for example. □ 

4.3 Equivalence between CTMDPs and SMDPs under Condition I^TT] 

In this subsection, we always assume Condition 13.11 to be satisfied without explicit indications. Under this 
condition, we establish an equivalence between CTMDPs and SMDPs by showing that they have the same 
occupancy measures; see Theorem 14.21 below. We will also show in Subsection 15.31 below that the reduction of 
the CTMDPs to SMDPs in a weaker form is still possible if Condition 13.11 is replaced with the nonpositivity of 
the reward rates and and the compactness of the action space. 

Recall the agreements of x (ico) := and ^ := 0, which are used below without special reference. 

For the CTMDP, given a policy tt G HctmdP: we define for each n ~ 0, 1, . . . the occupancy measure on 
B{S X A) by 

MZJdy,da) = EZ 



e; 



t„+i 

e 

in 



e- /o' "(«^Od^/{a.^ e dy]TT{da\u:, t) ((?,„ (a) + a(x„)) dt 
/o "(«=)''« J{.T„ G dy}TTn{da\xo, 6*1, ... , .t„, t - i„) {q^^ (a) + a(a;„)) dt 



(15) 



®Note that here the restrictions of the measures P^{xn £ dx,dn G da) and P {xn £ dx,d„ G da) on S X A under consideration 
can be sub-stochastic. 
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where the second equahty follows from the predictability of the policy © ■ Below the predictability of tt is used 
a lot but we often prefer the more implicit but compact expression as in the right hand side of the first equality 
of (jl5[) to the one in the second line of (flS)) for notational convenience. Keeping this in mind, and due to the 
construction of the CTMDP (see Remark |2.1|) and the relevant properties of conditional expectations, we obtain 
that 



AqjS X A) 



e:; 



e; 



-i:rjo'"(^o»i+ip-"(=^")(*-*")i 



e ^^=0 "v-.;-.+ie-"^— A--'«)/{a,^ g s} 



Xo,t/i,, 



qx^{a)TT{da\Lj,t) +a(x„) I dt 



= ^7 



e; 



^' e-a(-,.)* f f q^^ {a)TT{da\iu, i„ + t) + a{xn) ] dt 



Xo,t/i, 



e; 



,-Erjo'"(^ 



^v{x„e5}y"e-"(^")*('y 



(lx„ (a)7r(da|w, i„ + t) + a(a;„) 



Jo 



e:^ 



-a(x„)t / / g^Ja);r(da|tj,i„+i)+a(.T„) 



3^ /o /a «^.> (a)7r(da|w,t„+s)ds^^ 



= ^7 
< 1, 



i = "(^ 



^i/{a;„ e 5} 



(16) 



where the last equality in P^ is by Condition 13. 1[ which guarantees e~Jo /A9=n('')'^('''»l"^*"+*)+"(^")''* — a.s. 
with respect to P7^. 

The marginal of MJ^„ on S is denoted by m?! Jydx) :— M^ Jydx x A). 

Lemma 4.2 Consider the SMDP with the following primitives; S ~ S , A ~ A, A{x) ~ A{x) (x G S), and 
Q{dy, [0,t]\x,a) = "^^ y\i^n^^°-> (j^ _ g-qm{a)t\ ^ ^/jg^^g g^ j\^ A{x) and q{dy\x,a) come from the primitives of the 
CTMDP. Then for each tt £ ^ctmdp, there exists a Markov policy a £ \^smdp such that 



M!:„(r) = M^„(r) for each T e ^(5* x A) and n = 0, 1, 



(17) 



Proof. Let n G nc7-j\f/)p be arbitrarily fixed. For any n = 0, 1, . . . , according to what is explained before this 
statement, -M^„ is a finite measure on B{S x A), and thus by Proposition D.8 in [50], there exists an m^ „-a.s. 
unique stochastic kernel an{da\x) from S* to A concentrated on A(x) for each x & S such that 



AfJ'„((ix, da) = mr; „(da;)cr„(da|x). 



(18) 



Clearly, a — (crn)ri=o,i.... defines a Markov policy for the SMDP described in the statement of this lemma. We 
now show that this policy a satisfies (jl7|) for each n — 0,1, . . . by induction as follows. In the case of ri = 0, 
from (fTUl) and (|16p with dx in lieu of 5, we see 



m:;_o(da;) = E:^[I{xo e dx}] = -/{dx) = E^[I{S;o £ dx}] = m'^^„{dx). 
Furthermore, it follows that 

MJ^f){dx,da) = m'!^_Q{dx)aaida\x) = m'^^Q{dx)ao{da\x) = M^f^{dx,da), 



(19) 



where the last equality is by the fact that a G ^smdp is Markov and (fT2)) . Thus, (flT)) holds for the case of 
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Now suppose pT]) holds for n = k, and consider the case of ?i = k + 1. Then, on the one hand, from pTj) 
and the definition of Q{dy, dt\x, a) given in the statement of this lemma, we see 



TO^,fc+i(da;) 



(dx \ {y}|y,a)e-("(«)+««('^»*dt KAdv^da) 



q(dx 



K \Jo 

K a{y) + qy{a) 

^('^\^fy^f M;Ady,da) 
K a{y) + qy{a) ^' 



e; 



e:; 



'fc+i 



e /o"(«=)rf« / q{dx\{xk}\xk,a)7r{da\ujj)dt 
u, J A 



(by the inductive supposition) 
(bydUD) 



g-EfcO a(xi)ei+i j^TT 



'/s + l 



-a{xk)t 



q{dx \ {xk}\xk,a)'n-{da\ui, tk + t)dt 



Xo,tli,. .. ,Xk 



e; 



e-Et~o"(^.)e,+i / g-a(:r.)t / q{dx\{xk}\xk,a)TT{da\u,tk + t) 



P"{Ok+i >t\xo,ei,...,Xk)dt 



e; 



e-Et-o"(-.)fl.+i / ^-<^(.-.)t / q{dx\{xk}\xk,a)TT{da\Lo,tk + t) 



- So Ia I'^k {a)-^{da\uj,tk+s)ds^^ 



(20) 



Let di^efc+iij, = f-^) stand for the Lebesgue-Stieljes measure generated by the (conditional) distribution func- 
tion of 9k+i- Then, from ((20l) we have|j 



^Ik+iidx) = e; 



e; 



^-I2'^=i o'{xi)ei+i 



,-I2i=o '^{xi)Si+i 



a(x^)t lAl(dx\{xk}\xk,a)n{da\uj,tk+t) 
JAlxki(^)'^ida\uJ,tk +t) 



dF, 



fc + l|xo,8i,...,xj. 



it) 



■El 



e; 



Xo,tli,. . .,Xk 



-a(xOe;,+i Ia ijdx \ {xk}\xk, a)7r{da\uj, tk + Bk+i] 
Ia l^k ia)Tr{da\uj, tk + Ok+i) 
=- Eto "(^.)0.+i Ia l^dx \ {xk}\xk, a)n{da\uj, tk+i) ' 
lAl^kiaMda\u},tk+i) 

On the other hand, similarly to the derivation of (|16p with S being replaced by dx, we obtain 



(21) 



S^ 



^7 



^7 



-E-=o"(^.)e.+i 



I{xk+i e dx} 



Xq, Wi, . . . ,Xk,ak+i 



-Et 



to a{x,)e,+i Ia ijdx \ {xk}\xk, a)7r{da\uj, tk+i 



lAl^k{a)Trida\uj,tk+i) 



(22) 



with the last equality following from the construction of the CTMDP. Comparing this with (PT|) we see 

'rri^,k+iidx) = m^ fe_^i(da;). 
It follows from this and a reasoning similar to the one of (|19p with being replaced by k + 1 that 

M^,k+iidx, da) ^ Ml^^^ [dx, da) . 
Now the statement to be proved follows from the mathematical induction. □ 



''Recall the convention of S := for the validity of the next three equalities. 
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Lemma 4.3 Consider the SMDP as defined in the statement of Lemnia \4-.S\ using the primitives of the CTMDP. 
For any policy tt € TIsmdp, let a = {an)n=o,i.... G ^smdp be the Markov policy corresponding to n ^ ^smdp 
as in the statement of Lemma\4-.1\ Define a policy tt = (7r„) € TIctmdp with 7r„ given by 



nn{da\xn,t~ tn) 



e-9-n(a)(t-«r.) 



/^e-'?=="W(*-*")a„(da|a;„) 



cr„(da|x„). 



The 



A/^_„(r) = M^^„(r) = A^*„(r) for each V e B{S x A) and n = 0, 1, 



Proof. Throughout this proof, 7 on 5* is the common initial distribution for the SMDP and CTMDP. Wc foUow 
the argument as in the proof of Theorem 8.4 of [S], which is exclusively about standard discounted CTMDPs. 
Note that under the fixed Markov policy a = (crri)n=o,i.... G ^smdp^ one can uniquely define the marked point 
process {tn,Xn] and thus the SMDP by specifying the dual predictable projection, say 



V [dt, dy) := ^ — — ^^,^^_- ^ , , ,, , i{K < t < t„+i}dt 



ra>0 



/^e-9^"('^)(*-*'.)o-„(da|i„) 



(23) 



in accordance with Theorem 4.21 of [53]. The corresponding probability measure is still denoted as -P5(-)- We 
also note that under the policy tt € TIctmdp as defined in the statement of this lemma, the dual predictable 
projection defining {i„,x„} and thus the CTMDP coincides with the one given by (P5|) . see also ^. Thus 



pTT ^ pa 

1 1 ■ 



(24) 



For this reason, below in this proof we simply write {t„, a;„} as {f„, a;„}. Now let n = 0, 1, . . . , Fg G B{S) and 
Ta G 'S(A) be arbitrarily fixed. We show 



as follows. Indeed, we have 

Af-jFsxF^) = E; 



Ej; 



iVrjFs X Ta) = KJFs x F^) 



g-/XC.)d. / (^q^Ja) + a{xn))I{xn eTs.a eTA}TTn{da\xn,t - t^)dt 



-a.(xn)t 



-T.1^0 '^(.'■^i)Si+lJ^T^ 



I{xn e Fs,a G VA}{qxr,{a) + a{xn))TTn[da\xn,t)dt 



Xo,tli 



, . . . , a>7T, 



= EZ 



-T,7=0 °'(^i)^i + l 



i{x„)t 



= EZ 



{qx„{a) + a{xn))I{xn € Fs,a G F^}7r„((ia|a;„, i)P^(6'„+i > t\xo,0i, . . . ,Xn)dt 

/•OO r 

-ErJo^"(-.)s.+i / e-"(-")* / (fe„(a) +a(.T„))/{x„ G F5,a G F^} 



/^e-9-"('^)V„(da|a;„) 



P^iOn+i >t\xo,ei,...,Xn)dt 



where the first equality is due to (jlSp , and the last equality is by (|24[) and the definition of the policy tt as in 
the statement of this lemma. Since 



7 l'7„+ 



1 > t|a;o,6'i,...,a;„) = / e '5'-"(")*cr„(da|a;„), 

J A 
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the previous calculations further lead to 



EZ 



EZ 



El 



= EZ 



Erro^a(^.)e.+i / e-"("")* / (g,„(a) +a(.T„))/{x„ G T^.a e rA}e-?-('^)V„(da|x„)di 






'T.7=0 °'i^i)^i + i 



' / /{x„ G rs,a G rAJCTnlc^alxn) 



M^%(rs X Ta), 



where the second to the last equality is by (P^ , and the last equality is according to the construction of the 
SMDP and ©. The proof of this lemma is thus completed. □ 

Theorem 4.2 Consider the DTMDP with the state space Sao = "^IJIxoo}; action space A^q = AlJ{aoo}, 
admissible action spaces A{x) [x G S'oo), o-nd the transition probability given by p{dy\x,a) = a(x)+ (a) 
for each dy G B{S) and x (^ S, a £ A{x), p{{xao}\x^a) = 1 — p{S\x,a) for each a; G S", a G A{x), and 
pi{xoo}\xoo,aoo) = 1- Then for any n G Hctmdp (resp., a G Hdtmdp), there exists some a G Udtmdp 



(resp., TT G UcTMDp) such that 1^(7, 7r,r,) = E^ |_Efclo a(a„)+g^„(a„) 



for each i ^ 0,1, . . . ,N. 
Proof Since W{^, n, ri) = J2n=o Ik aix)+qla) ^'^,n{'^^^ da) for each tt G Uctmdp under Condition [O (by (US]) 



and @), and also 



E^ 



E 






E 



T* " ( T* fl 1 -F T 

P (x„ G dx,dn £ da) V tt G lioTAiDP^ (25) 



^oJk aix)+qx{a) '' 



the statements in this theorem follow from Theorem 14.11 and Lemmas 14. 21 and 14.31 as well as the well known fact 
that Markov policies are sufficient for DTMDPs (e.g., Lemma 2 in [25]). □ 

5 Proofs of the main statements 

5.1 Solvability results for DTMDPs 

In this subsection, we consider a DTMDP defined using the primitives of the CTMDP. Let the state space 
of the DTMDP be 5oo = S[^{xoo}, action space A^o = ^Ui'^oo}, admissible action spaces A{x) [x G S'oo), 
and the transition probability given by p{dy\x,a) = aU)+ (a) ^'^^ each dy G B{S) and x G S*, a G A{x), 
p{{xoc}\x,a) = 1 — p{S\x,a) for each x G S", a G A{x), and p{{xoo}\xoo, o-oo) = 1- Let the possibly infinite- 
valued reward functions for the DTMDP be Ri{x,a) — ^,^'.,^_f'°! s for a G A{x),x G 5, and Ri{xoci,aoo) = 
with i = 0,1, . . . ,N. Denote by Xn and a„ the controlled and controlling processes. 
For this DTMDP, we consider the following unconstrained problem 



E^ 



}^Ro{Xn,an) 
n=0 



max 

itGHotaidp 



and constrained problem 



E^ 



s.t. E^ 



y^ Ro{Xn,an) 
,n=0 

00 

E^^c 



[Xn , Q>r, 



.71^0 



max 



j^l,...,N, 
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with the initial distribution 7 the same as the one for the CTMDP. The concepts of optimal and constrained- 
optimal policies for the DTMDP problems are understood similarly as for the CTMDP problems defined earlier. 
To improve the readability and for the convenient future reference, we collect some known facts about the 
solvability of the aforementioned two DTMDP problems in the next proposition, whose parts (a,b) follow from 
Theorems 15.2 and 16.2 of [33], and part (c) is due to Theorem 4.1 of [5]. By the way, we point out that a 
careful checking reveals that Theorem 4.1 in |5] remains valid if the reward functions can take — cxd. 

Proposition 5.1. 

(a) Consider the unconstrained DTMDP problem. If the multifunction x — t- A{x) is compact-valued and sepa- 

rable, J„ f(jj)p{dy\x,a) is continuous in a G A(x) for each x € S and bounded measurable function f on 
Soa, and Ro{x,a) < for each a G A(x),x e S, and is upper semicontinuous in a £ A{x) for each x £ S, 
then there is a deterministic stationary optimal policy. 

(b) Consider the unconstrained DTMDP problem. If A{x) = A for each x € S, A is compact, J^ f{y)p{dy\x, a) 

is continuous in (x,a) for each bounded continuous function f on Soo, and Ro{x,a) < for each x € S, 
a £ A, and is upper semicontinuous in {x,a), then there is a deterministic stationary optimal policy. 

(c) Consider the constrained DTMDP problem. Suppose that A{x) ~ A for each x £ S, A is compact. 

Is f {y)p{dy\x , a) is continuous in {x,a) for each bounded continuous function f on Soo, and for each 
i = 0,1, . . . , N, Ri{x, a) < for each x £ S, a £ A, and is upper semicontinuous continuous in {x, a). If 
there exists some feasible policy tx £ Hdtmdp such that E^ [X^^o ^o('^" ''*")] ^ —00, then there exists a 
randomized stationary optimal policy. 



5.2 Proofs of Theorems [SUl EH and [33] under Condition [311 

Before starting the proofs, we note the following fact. 

Proposition 5.2 Consider the DTMDP given in Theorem \4.S\ under Condition \3.1\ Then, it holds that 



sup VF(7, 7r,rj) 



sup 



E^, 



Tren 



DTMDP 



sup Ty(7,7r,ro) 



sup E 



Tren 



CTMDP 



iren 



DTMDP 



E 

CXD 

E 

.fe=0 



a{xn) + g5„(an) 
ro(x„,an) 



(26) 
(27) 



where ^otaidp ■= {^ ^ ^dtmdp ■ E^ J2T=o ^ 



Tj {Xn ,0-71 ) 



.)+'7x„('i„) 



> d, 



J 



1,2,. 



.,N}. 



Proof. Obviously, it follows from Theorem 14.21 D 

Condition l3.11 as assumed throughout this subsection, is essential to the above proposition, as demonstrated 
by the examples in the next section. 



Remark 5.1 The statement of Proposition \5^ is still valid if we consider the DTMDP with the state space 
Soo, action space A, admissible action spaces A(x) (x £ Soo) with A{xoo) — A, and the transition probability 
defined by p{dy\x,a) = '^afJi+n (a) ■^''^ each dy £ B{S) and x £ S, a £ A{x), p{{xoo}\x,a) = 1 — p{S\x,a) for 
each X £ S, a £ A, and p({xoo}\x 00, a) — 1, given that ri{xoo,a) :— for any a £ A. 

Having established the above statements, we are now ready to give the proofs of some of the main results 
in this article as follows, whereas the proofs of the other statements, which do not require Condition 13.11 will 
be postponed to the next subsection. 



Proof of Theorem 13. Jl Keeping in mind Proposition 15. 2[ part (a) of this statement follows from Theorem 
3.1(B) of |7]; part (b) is a result of Chapter 5, Section 2 of [5]; whereas parts (c) and (d) arc by Propositions 
9.10 and 9.14 in |5]. As for part (e), we note that 



^^ Wix,7:,\ro\) ^^^ El [j^ e'^* j^ M^wi (6, a)7r(da|c^, t)dt\ ^ M2{b^ + /3) 
xes wi{x) ~ xes wi{x) ~ /3{(3 - a) 



(28) 
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for each x £ S and tt G ^ctmdp, where the first inequaUty is by Condition 13. 2[ and the second inequality is 
according to Proposition 2.f of ^7\. Thus, Condition 13.21 is stronger than the conchtion in part (b). Now we 
observe that under Condition [3?2j the strong continuity condition (S) in [34] is satisfied by the DTMDP defined 
in Proposition 15.21 Thus, by Proposition [521 of this paper. Theorem in [34] and the observation foUowing from 
part (c) that 



Wix,TT,ro) ^ El 



E 

.k=Q 



?-o(Xn,an) 

a{xn) + gi„(a„) 



(29) 



under each deterministic stationary pohcy tt G ^ctmdp = ^dtmdp > where ^ctmdp ^^'^ ^dtmdp stand for 
the class of deterministic stationary policies for the CTMDP and the DTMDP, we see 



W{x) 



sup 



W{x,Tr,ro) 



neni: 



sup 



W{x,TT,ro) 



for each x G S, where ^ctmdp denotes the class of randomized stationary policies for the CTMDP. Therefore, 
for problem © we can be restricted to the class of randomized stationary policies as in [3H1 . It remains to refer 
to Theorem 3.3 of [53] and Section 4 therein for the statement. Thus, the proof of this theorem is completed. 
D 

Proof of Theorem [371 Since ^ holds for any policy n e ^Stmdp == ^dtmdp with the DTMDP being 
defined in the statement of Proposition [521 part (a) of the statement follows from Proposition 15.21 of this paper 
and Theorem 15.2 of [35] (see Proposition 15. If a)), while part (b) follows from the proof of part (e) of Theorem 
I3.1l in this paper and Theorem 3.3 in [5^. D 



Proof of Theorem \3.3\ We prove part (a) first. Under the conditions of the statement, the DTMDP defined 
in Remark 15.11 satisfies Assumptions 2.1 and 4.1 in [5]. Consequenty, we refer to Theorem 4.1 of [5] for the 
existence of a randomized stationary policy tt* G ^dtmdp such that 



E" 



E 

.k=0 



ro{xn,a„ 



a{xn) + gx„(a„) 



sup 



E" 



Tren 



DTMDP 



E 

.fe=0 



ro{Xn:an) 



Now define a randomized stationary policy (p* for the CTMDP by 

TT* {da\x) / {a{x) + qx{a)) 



<j)*(da\x) 



I A I \ I — t-^t:* (da\x\ 

J A a(a:)+qa;(a) ^ I '' 



(30) 



for each x £ S. We claim that the policy 0* is constrained-optimal for probelm ([5]). Indeed, it is well known, 



see e.g., [5], that E^ ^J2Zo c.{Z)+qZia„ 

r~{x,a) 



is the minimal nonnegative solution to the equation 



Vix) 



TT {da\x) + / V{y) / — --— — -— TT {da\x), x£ S (31) 

jia{x)+qx(a) Jg J a a{x) + q^ia) 



for each i = 0, 1, . . . , A^. On the other hand, according to part (c) of Theorem 13. 11 W{x, 
nonnegative solution to the equation 



is the minimal 



V{x) = 



Ia^j {x,a)(t)*{da\x) 
/^(a(x) + qx{a))(f)*{da\x) 



s /a ("(2-') + qx{a))(t>*{da\x) 



rj {x,a) ,/ , I X , f .rf ^ f q{dy\{x}\x,a) ,, , , . ^ „ 

---^ -— TT (dalxj + / V(y) / ^—^ r^"" (da\x), x £ S, 

jia{x)+qx{a) Jg ^ 'Ja a(x)+g,(a) 



(32) 



where the second equality is by the definition of 0*, see f|3Q)) . By comparing (|3T]) with ([32l) . we see 



W(x,(^*,r: 



E-^ 



E 

.fc=0 



a{Xn) +qS:r.i.O,n) 
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and thus 



Wi^,r,n)^E._ 






E' 



for each i ^ 0,1, . . . ,N. It follows from this, the definition of tt* , and Proposition 15.21 that cj)* is a randomized 
stationary constrained-optimal policy for problem ([S])- 

Now we turn to part (b). We first define an auxiliary CTMDP with the state space S, action space A, 
admissible action spaces A{x) {x 6 S), and transition rates \^(!^) (recalling that a(x) > under Condition 
I3.2f f)). It can be easily verified that the corresponding version of the conditions of this statement are satisfied 
by this auxiliarty CTMDP with the constant discount factor 1 and the reward rate 



I{x e S,a e A}{a{x) + q^{a)) 

rlx, a) = ^-^ > 0. 

a{x) 



(33) 



Thus, it follows from the proof of Proposition 2.1 in [57], see also [55], that 

/{Ctg.S.agA}(a(6)+gtt(a)) 



sup^encTA/DP ^x 



sup ■ 



/„°° e-* /^ iiSi5£:iiH^iv^S£miii^^(da|c., t)dt 



w{x) 






where EZ denotes the expectation for the auxiliary CTMDP under the policy tt with the inital distribution 7. 
By Theorem 3.1 in [30| or Theorem 13. II in the present paper, 



Wix) := sup El 

ttGIIctmup 



e * / ^te^^.«^^}M6) + g,.W)^(,^|^^,),, 



/o JA o:{x) 

is the minimal nonnegative upper seminanlytic (indeed measurable) solution to 

r'{x,a) 1 



W{x) 



sup 



a&A{x} [ aix) + qx{a) a{x) + q^ia) J s\{x} 



W.iy)'l{dy\x,a) } , X € S, 



where 



r'{x, a) := I{x E S,a E A}{qx{a) + a{x)). 
Comparing this with ([7]), we see from Theorem 13. II in the present article that 

sup W{x,'K,r') = W{x) \fxeS. 

TrGUcTAinp 

It follows from this and fl34ll that 



sup Wij, TT, r') < f Wix)-f{dx) < ^^^ + 4^^ I l{dx)w{x) < ^ 
TveUcTMDP Js P\P~c) P{P~c)Js 



where we recall the condition of J„j{dx)'w{x) < cxd for the last inequality. 

Consider now the DTMDP deiincd in Proposition [221 According to Lemmas 
have that for any policy tt G Hdtmdp, there exists a policy tt' g Hctmdp such that 



(35) 
and Theorem 14. II we 



E" 



y^ I{xn G 5, d„ e A} 

.n=0 



W{-f,TT',r'). 



It thus follows from this and (|35[) that 

sup E 



Tren 



DTMDP 



^ I{in e S,a^e A} 

.n=0 



< 00. 
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In other words, the DTMDP with the initial distribution 7 on S is absorbing with the absorbing set {xoo} in 
the sense of [12] . Now on the one hand, for any pohcy tt' e Tier ai dp there exists a pohcy tt G Hdtmdp such 
that 



J^M:;'jdx,da) = E; 



n=0 



y I{xn S dx, a„ G da} 

.n=0 



by Lemmas 14.21 H751 and Theorem 14.11 and for this pohcy tt G nDrj\/L>p, there exists a randomized stationary 
pohcy say tt* G Hdtmdp satisfying 



E" 



y I{xn G dx, a„ G da} 



.?i=0 



£; 



^ /{^n S ^2:, ftn S c?a} 



.n=0 



by the fact that the DTMDP is absorbing (just shown) and Lemma 4.2 of [12] . Thus, for any pohcy tt' G 
^CTMDPi there exists a randomized stationary pohcy tt* G H-dtmdp such that 



M^(7,7r',r,) = i?! 



^^a(i„) + gi„(a„) 



E 



(36) 



for each i = 0, 1, . . . , A'^. On the other hand, as in the proof of part (a) of this statement, for the randomized 
stationary pohcy tt* G ^dtmdp, one can construct a randomized stationary pohcy </>* G Hctmdp such that 



Wi^,r.n) = E. 



E 



a{xn) + qs:,^{dn) 



(37) 



for each i = 0, 1, . . . , iV. Combining ([55]) and ([57]) yields 

W^(7,7r',r,) = l^(7,0*,n) 

for each i = 0, 1, . . . , A^. As tt' G TIctaidp is arbitrarily fixed and 0* G Her a/dp is randomized stationary, this 
means that for problem ([5]) it suffices to be restricted to the class of randomized stationary policies ^ctmdp- 
Next we consider again the auxiliary CTMDP with the state space S, action space A, admissible action 
spaces A{x) {x G S), and transition rates '^ aix) (recalling that a{x) > under Condition 13. 2r f)). with the 
optimization problem 



m 



s.t. E; 



ro{Ct,a) 
«(6) 



TT{da\^t)dt 



mm 



(38) 



e -* / !l%^^(da|^,)df 



a(6 



>rf„ j = l,2,...,iV, 



where we recall that EZ denotes the expectation for this CTMDP under the policy tt and initial distribution 7, 
and ^cTMDP ^till denotes the class of randomized stationary policies for the CTMDP. 

Based on a relation like the first line in p2p . one can show that under each randomized stationary policy 



G H'S 



Ei 



«(6) 



4>*{da\^t)dt 



Wi^,r,n) 



for each i = 0,1,..., A^. Indeed, this follows from the facts that E_^ 
W{x,(l)* ,r^ ) are both the minimal nonnegative solution to the equation 



r^-7^^i^'^*('^«i6)di 



and 



V{a 



J^rf{x,a)(t)*{da\x) ^ f^ J^q{dy\ {x}\x,a)(t)*{da\x) 

J^{a{a) +q^{a))(j>*{da\x) 



J^{a{x) + qa:{a))(j)*{da\x) 



by part (c) of Theorem 13.11 and the facts of 

Ei 



e * / i%^r(rf«i6)rft 



«(6) 



Ef 



e-* / 'M^<f>*ida\^t)dt 



a(6) 



"f{dx) 
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and thus 

Js 

Therefore, problem ([55)) is equivalent to problem ((S]). Finally, it remains to refer to Theorems 3.2 and 3.11 of 
|27] for that there exists a randomized stationary constrained-optimal policy for problem ([35]), which is also 
constrained-optimal for ([5]) in accordance with the previous discussions. The proof of this statement is thus 
completed. □ 

5.3 Proof of Theorem 13.41 without assuming Condition 13.11 

Throughout this proof, by an SMDP, we always mean the one defined in Lemma 14.21 using the primitives of the 
CTMDP. In this subsection, Condition 13. II is not required at all. 

We now introduce some further denotations, assuming that Conditions I3.2f b') and I3.3f a-d) are satisfied. 
Recall that 

Si := ix G S : a{x) + inf ^^.(a) = 0, inf ^^.(a) -^rj(a;, a) > > , 
B{x) ■.= {aeA: q,{a) = 0}, 



and define 



Si := { X e Si : sup qx{a) ^0)- , 

aeA 



S2 := I X e S : a{x) + inf Qxia) - y^ r,;(a;, a) 



= , 



S3 := < X e S : a(x) + inf Qxia) > 

1^ aeA 

which are all measurable (by Proposition D.5 in [10]). Clearly, a{x) = for each x £ 5'ilJS'2. Moreover, Sk 
(k = 1,2,3) form a disjoint measurable decomposition of S. The reason for considering this decomposition is 
explained in the following remark, which helps to understand the proofs below. 

Remark 5.2. Suppose the conditions of Theoreni \3.4^ h) are satisfied, apart from that for now Si\Si might he 
nonempty. We say that a state x possibly absorbing if there exists an action a € ■A{x) such that qx{a) = 0; in this 
case, the action a is called x-absorbing. If for a possibly absorbing state x, all the actions are x-absorbing, then 
the state x is called definitely absorbing. Moreover, we remind the reader of that the goal here is to maximize 
the expected total rewards discounted at a state- dependent and possibly zero-valued rate, and ri{x, a) < for all 
{x,a) G K. 

(a) The set Si is the collection of possibly absorbing states, at which the discount factor is zero and all the 

absorbing actions only lead to strictly negative reward rates ri{x, a) for at least one i ^ 0,1, . . . , N. 

(b) The set S2 is the collection of possibly absorbing states, upon which the discount factor is zero but there 

exists at least one absorbing action which gives zero reward rates ri{x, a) for all i = 0,1, ..., N. Thus, a 
policy that selects such favourable x-absorbing actions whenever the process is at x G S2 outperforms those 
that do not, given that they are the same in the other situations, since then the process will be absorbed at 
that state leading to zero future rewards, which is the best possible situation. 

(c) The rest of the states is collected by the set S3. 

(d) The subset Si C Si is the collection of definitely absorbing states, where all the (absorbing) actions lead 

to ri(x, a) < for at least one of i = 0,1, . . . , N. It does not matter what action(s) a policy selects when 
the process visits a state x E Si, since this would lead to the situations that are excluded by the conditions 
of Theorem \3.4^ b), see Lemma 15. Jl below together with the discussion immediately above it. Here, the 
finiteness of A is essential. 
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Under the conditions of Theorem 13. 4f b). apart from that for now 5*1 \ Si might be nonempty, it is intuitive 
to expect that under any feasible pohcy with finite value for problem ([5]), the controlled process visits Si with 
probability zero. We formally prove this fact in the next lemma, for which we introduce another notations: for 
each fixed ti = 0, 1, . . . , and policy tt = (7r„) <S Hctmdp, we define a measure on 5 x ^ by 



MZ^{dx,da) -.^E; 



e -f^o "(5=)''''/{a;„ G dx}TTn{da\xo,Oi,- ■■ ,a;„,i - t„)di 



(39) 



Lemma 5.1 Suppose Conditions \3.S^ b) and \3.3]f a-d} are satisfied, and consider problem (^. Furthermore, 
we assume ri(x,a) < for each i = 0,1,..., A^. Then, for each feasible policy n £ TIctmdp satisfying 
Ty(7,7r,ro) > — cxd, it holds that 



JiS\Si 



)xA 



q{Si \ {x}\x, a)A'C^{dx, da) = 0, Af^„(S'i x A) = for each n = 0, 1, . 



and 



liSi) = 0. 
Proof. For each n > 0, by ([55)) and Remark |2. II we see 



M:;„{dx,da) 



e:^ 



e; 



I{xn S dx}e' 



■ 127=0 a{xi)ei+i-aix„)(t-t„) 



nn{da\xo 



J "il ■ • ■ 5 



Xyi , t T'n jai 



e; 



^''"^^*~*"V„(da|a;o, 6*1, . . . ,x„,i - tn)dt 



■^Oi ^li • ' • •) ^71—1 7 "n 1 -^n 



= e; 



- 127=0 °'i^i)^i + l E"^ 



I{dn+1 > t}I{Xn e dx} 



■^Oj "l ; • ■ • 5 ^n— 1 ■) ^n-) ^n 



o 

g-a(.„)t-/„' /^ g.„ {a).„{da\xo.Ou...,x„,s)dsj^^^ ^ dx}TTn{da\xo, ^1, . . . , X„, t)dt 



e:; 



which, together with the following fact from Remark |2. II that 

P^{d0n+l,dXn+i\xo, 6*1, ... , Xn-l,0n, Xn) 

= g-/„''"+V^9.„(a)-„(da|.oA...,-...)d. f q{dx,,+i\{x„}\x„,a)7r„{da\xo,0i, . . . ,xn,en+i)der,+i, 

J A 

leads to that for each Ts G B(S'i), 
Ml^+i{Ts,da) 

g- Yl7=o o(a;i)0i+i-a(a;„)6'„+i 

o 

g-a(.„+0*-/o' /. 9.„ + , (a).„ + i(da|.o A,.....„.«)d.^|^^^^ ^ rs}^„+l (da|xo , ^1 , . . . , X„+i, t)dt 

/■OO 



^" 
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where the second equahty is due to that for each x G Si, Qxia) = for each a G A and a(x) = 0. Thus, 



e:; 



e:; 



e; 



g-a(.„)e„+i / i{xn+ieTs]TTn+i{da\xQ,ei,...,Xn+ut)dt 

JO 

p- Yl!iZo a{xi)0i-f-i I „-a{xn)u-fg q^^{a')TTn{da\xo,di,...,x,^,s)ds 

Jo 



XQ,tli, ... ,Xr, 



q{rs \{xn}\xn,a')TTn{da'\xn,9i, . . . ,Xn,u) j du TTn+i{da\xn,9i, . . . ,y,t)dt 
q{Ts \ {xn}\xn,a')Tr„+i{da\xo,Oi, . . . ,y,t)dt MJ^ ,Xdxn,da'), 

ISxA \J0 J 

Smce ri(x, a) < for aU (x, a) G K and (j2:(a) = for .t G 5*1, from (P(I| we obtain 



(40) 






< 



< 



< 



N 



™^^(Y1 ^iiV' '^)}'l{dy \ {x}\x, a')M^„{dx, da) 



SxAJSi °e^ j^o 



di 



TV 



(S\Si)y.AJSi -^eA — 



,Af 



niax{y^ri(y,a)}q((ij/\ {x}\x,a')MJ' j^{dx,da') 



dt. (41) 



^Af 



On the other hand, since — oo < VF(7,7r,^j^g r^) and maxagyi{^j^Q ri(x,a)} < for each x G S'l by the 
definition of ^i and the compactness of the set A, from (|4T|) we deduce 



N 



"^^?{5]1 ^»(2^' °'))li'^y \ {a;}k, a')M!'„(dx, da') 

i=0 

^^^i^^^'^^'")} / q{dy\{x}\x,a')]\/n {dx,da') 



{S\Si)xAJSi '^e^ ,^0 

Af 



0, 



^N 



which, together with that maX(jgyi{^,-^Q r,;(a;, a)} < for each x G Si (just explained), in turn imphes 

/ qiSi\{x}\x,a')M:;„idx,da') = 0. (42) 

J(S\Si)y.A 

Moreover, from (l40l) and (1421) we have 

/"OO /" 

q{Si \ {xn}\xn, a')M^ j^{dxn,da')dt 



M:;,n+i{Si X A) 



JSxA 



J{S\Si)xA 



q{Si \ {xn}\xn, a')MI" ,^{dx„,da')dt 



0, 



(43) 



where the second to the last equality is by the fact that qx{a) = for each a G A whenever x G Si. Thus, we 
have proved that AH ^^{Si x A) = for each n = 1, 2, . . . . 
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Now consider the case of n = 0. Similarly to as in the previous calculations, since ri(x,a) < for all 
{x, a) E K and i = 0, 1, . . . , A^, from ([5^ we have 



N 



SxA ■ 



e; 



e; 



y^ ri[x, a)MJ^Q{dx, da) 



-a{xo)t 



N 



y^ ri{xo, a)7To{da\xo,t)dt 



I I{Oi > t}e-"("°)* / V r,ixo,a)7roida\xo, t)dt 

Jo J A ,^0 

/ / P;j0i > t}e-"(^°)* / Vr,(xo,a)7ro(da|xo,i)7(rfa;o)di 

Jo Js J A ,_n 



JS 



^- /o 9:^0 (a)iro(rfa|2;o,s)ds -Q(2;o)t 



N 



y ri{x, a)Tro{da\xo,t)j{dxo)dt 



i=0 



Jo Js 



^ fa Sa 1=^0 {a)-!roida\xo,s)ds -a{xo)t 



N 



y ri{x, a)no{da\xo , t)'-f {dxo)dt 



< 



Si Jo 



N 



max{> ri(x,a)}dt'-f(dxo) < 0, 



aeA 



(44) 



Since /g^A Si=o ''*(^'")^'-^7,o('^2:,(ia) > W{j,'K,'Y.i=o'^i) > ^oo> and maxa(zA{Y.i=o'^ii^^"')} < ^°^ each 
X e 5*1, it follows from (|33]) that 7(5*1) = 0, and so 



M^oiSi xA) = e; 



,-a(xo)t i 



e "^-^"''/{xo e Si}dt 



JSi 



£-/(' /A«==o('^)'^o(da|2:o,s)dSg-a(a;o)t^/^2;o)di = 0. 



(45) 



Thus, combining (|42)) -(|45 |) completes the proof. □ 

Suppose the conditions of Theorem I3.4r b) are satisfied. We observe that, according to Remark 15.21 and 
Lemma |5.1[ for problem ([S]) it suffices to be restricted to policies n ~ (7r„) G TIctmdp that satisfy properties 
(i)-(iii) below: 



(i) for each n = 0, 1, . . . , when .t„ G 5i \ 5i, then 

qx„+iia)Trn{da\xo,0i, ■ . ■ ,Xn,s)ds = oo, 



/O JA\B{x„) 

with B{x) as defined by ([5|), which, to be more rigorous, means 

P;(x„e5i\5i) 



(46) 



e; 



I{xn e Si\Si} P^ I I I q^„^j(a)7r„((ia|.To,(7i, 

' 'o Ja\B{x,,) 



. ,Xn, s)ds = OO 



(ii) for each n = 0, 1, . . . , when x„ G 5*1, then 

■Kn{da\xo,Oi, . . . ,x„,t - t„) = 5a'{da) 
with a* £ A being a fixed point; 
(iii) for each 7i = 0, 1, . . . , when a;„ e 5*2, then 

Trn{da\xo,Oi,. . . ,Xn,t -tn) = 5f,(^^){da) 



(47) 



(48) 
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with /* being a measurable mapping from 5*2 to A such that 

aix) + inf (q,(a) -f^r,(a;,a) ) =0 = a{x) + q,{r{x)) -J^rrix, fix)) (49) 



N 



a£A 



i=0 



for each a; G S2, whose existence is guaranteed due to Proposition D.5 in [20]. (Hence, a{x) = qx{f*{x)) = 
r,{x, f*{x)) = for aU X e 5*2 and i = 0, 1 ■ • • , N.) 

We formahze these observations in the foUowing lemma for the future reference. 

Lemma 5.2 Suppose the conditions of Lemma 15. Jl are satisfied. Then for problem (^, out of the class of 
feasible policies with finite values, it suffices to be restricted to those satisfying (R6l), O^ and R3I), i.e., those 
exhibiting the aforementioned properties (i,ii,iii). 

Proof. Consider a feasible policy n with a finite value, and suppose that there exists some n such that P~ {^t„ S 
S'l \ 5*1) = PJ^{xn G 5i \ 5*1) > 0. Similarly to the proof of Lemma [5TT1 we see 



e; 



< e:; 



e:; 



e:; 



,1+1 



JV 



e Jo"(?=)''« / y2n{xn,a)TTn{da\xo,. ..,Xn,t)dt 

t„ JiA\B{x„))\JB{xr,} ,^0 

e-ELo "(-.)/ I{xneSi\Si} max {yr,(x„,a)}ds 

Jt„ aeB{x„) ^ 

JV 

e-i:?=o^i-')l{j:,^eS,\Si} max { V r,(x„, a)K+i 



JV „oo 

ErJo^ "(-.)/{a.,^ 6 5i \ 5i} max {V r,;(x„, a)} / e 



'/o/. 



A\B{j:n) 



q^^ {a)7Tji{da\xQ,6i ,...jXn,s)ds 



dt 



where for the second line we recall that a{x) = for each a; G 5i \ 5*1 by the definition of the set 5*1. If the 
property (i) does not hold, then 



P^(x„ G 5*1 \ 5*1, / / gx„(a)7r„(da|xo,6'i,. . . ,a;„,s)ds < 00) > 0, 

Jo Ja\b{x„) 



and thus 



/•oo ^ 



which implies 



e; 



/{a-„G5i\5i}e-^"=o'"("') max {Yr,{xn,a)} g- /o 7^\b(.„) ?==„ ^-"(d'^ko A,....x„,«)d«^^ 

aeBix,,) ^ Jo 



(recalling that X]i=o ''i(^i a) < for each x G 6*1 \ ^i and a G B{x)), and this gives a contradiction against that 
TT is feasible with a finite value. Thus (i) is verified for tt. That one can be restricted to feasible policies with 
finite values and exhibiting the properties (ii,iii) is due to Lemma [01 and the definition of the sets ^i and 82- 
D 

Next we derive some results similar to those in Theorem 14.21 but without assuming Condition 13.11 to be 
satisfied. 

Lemma 5.3 Suppose Conditions \S.2\f b) and \S.3\f a-d) are satisfied, and consider problem (0j. Furthermore, we 
assume ri{x, a) < for each i = 0,1, . . . ,N. Then for each feasible policy tt G Hctmdp satisfying VF(7, tt, tq) > 
— 00 and relations pSl), O^ and ^8^ , there is a Markov policy a = (cr„)„=o,i.... G ^smdp such that, for each 
n = 0,1,..., 
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(a) Aq„(r5 X Ta) = M^"„(rs X Ta) for each Ts £ B{S \ S2), Ta e B{A), and M^^^Si) = AI^JSi) = 0; 

(b) cr„(da|2:) — 5a'{da) for all x G Si, with a* as in C^; and 

(c) an{da\x) — Sfi^x){do,) for each x ^ S2, with f* as in |-^P[ j. 

Proof. We arbitrarily fix a policy tt € Hcta/dp as in the statement of this lemma. For each Ts G B{S) and 
n = 0, 1, . . . , a calculation similar to that of (fTB)) gives 



M^jTs X ^) 



EZ 



e-^"=o'^i^')f>'+iI{x,^(^Ts} 



X £; 



e-"(^-)t( / q^^{a)Tr{da\uj,tn + t) + a{xn)]dt 



Xc&i,. 



£;; 



e 



-1:7=0 '-(-.W.^i I {x^eTs} I e 

Jo 

p;ien+i>t\xo,ei,...,xn)dt] 



"^""^* ( / q.AaMda\oj,tn+t) + aixn) 



^'-» "(^')^'+^/{x„ G Ts} p e-"(^")* ("y g,„ {a)^{da\u:, t^ + t) + a{xn)] 

X e fo Ia 9="n (aXrfa|w,t„+s)ds^^ 



£;; 



e: 



7 



=- Er=^ 



."=0^"(2;i)Si + i/|2;^ g p j 1^2 — g-/„°° /4(9x„(a)+a(a:„))7r„(rfa|2;o,ei,...,2;„,s)ds 



< 00. 



Then, we can apply Proposition D.8 in |20) for that there exists a unique stochastic kernel ct„ from S to A such 
that (UHl) holds, that is 



Ml^{dx,da) = ml^{dx)an{da\x). 



(50) 



Since a{x) = qx{a) = for all x & Si and a £ A, and a{x) = fe(/*(a;)) = for all x & S2 (by (|i5|) and 
(Uni), it follows from (^5)) and ([SO)) with Fs being respectively replaced by ^i and £'2 that 



mZ,,iSi)=0 = mZJS2) Vn = 0,l, 



(51) 



which, together with (fT8|) . allows us to modify the definition of (Tn{da\x) (if needed) without loss of generality, 
such that 



an{da\x) = Sa*{da) for a: G Si, and (T„((ia|x) = ^^.(^.^((ia) for x G ^2 



(52) 



with a* and /* as in (|T7|) and (P^ . respectively. Thus, parts (b) and (c) of Lemma [5751 follow. 

Next we show that this policy a = (crn)n=o,i.... G nsMDP is the required policy as follows. 

Since parts (b) and (c) of Lemma [?751 follow from ([5^ and the second part of Lemma 15751 a) is from (j5ip as 
soon as the first part of Lemma IS.Bf a*) is true, the rest of this proof verifies the first part of Lemma l5.3r a) by 
induction. 

Consider the case of n = 0. Obviously, we see from ([iS]). ([50)1 and ([5T|) that 



™;o(rs) = 7(r5) = ™" o(rs) v Fs g b(5 \ S2). 

Since for each n == 0, 1, . . . , Fs G ^(5* \ S2) and F^ G B{A), it follows from 1^ and ([111) that 



M;„(Fs X Fa) = / m!;„(dx)CT„(FA|x) = / m^ „(dx)a„(FA|a;) = M^"„(Fs x F^) 
Jrs ' JTs 

as soon as m!^ „(dx) = m^ „((ix) on B{S \ S2), we see from ([55| and ([5^ that 

M;,o(rs X Fa) = Af;,o(rs X Ta) V Fg g B(5 \ ^2) and F^ G B{A). 



(53) 



(54) 
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Suppose that the following holds for some k > 0, 

M;.fc(r5 X r^) = m^JTs x r^) v Ts e b{s \ S2) and r^ e b{A). 

Consider the case of n = fc + 1 as follows. For each Ts G B{S \ S2), it follows from ([51]) and the definition of 
((T„) satisfying (j52p that 



q{^s\{y}\y,a) 
S2XA a{y) + qyia) 



a,{da\y)m;^,{dy) = = / ^(IfAM^Af,-,,(dy, da), 



S2X 



which, together with the arguments in (1211) and (jlip . gives 



<fc+i(rs) = / 



9(rs\{2/}|y,a) ,>, 






SxA a(y) + 9!;(a) 



(5\S2)xA a(2/) + gy(a) 



-M^Jdy,da) 



A o:{y) + qy{a) 



(bydnj) 
TT"^ r^—Mk[dy,da) 

S2XA a{y) + qyW 



'^^^^\{y}\y,a) 

{s\S2)xA a(y) + qyia) 

'-^^}^^Ml,idy,da), 
SxA a[y) + qy{a) 



q{^s\{y]\y.a) 

.xA Oi{y)+qy{a) 



cjk{da\y)mZ^j,{dy) 



= El 



e; 



Utk 



I^XiMs / q{Ts\{^tm,a)7T{da\u,t)dt 



-i:i 



to a(^,)e,+i Ia qj^s \ {xk}\xk,a)TT{da\u}, tk+i) 



J^qx,{a)T^ida\^^tk+i) 



(by dm) 

(as in dni)) 



(55) 
(56) 



Moreover, since q{Si \ {x}\x, a) = for each x ^ Si, from ([M|) and (|55|) with F^ := ^i we have 
m^.k+iiSi) = 

g(S'i \ {x}\x, a)M^{dx, da) = 0, 



' q{Si\{x}\x,a)M^{dx,da)+ / g(S'i \ {a;}|a;, a)Af^(dx, da) 

(S\5i)xA JsixA 



l{S\Si)xA 

where the last equality follows from Lemma [STTl Thus, by ([5T|) and (|57p . we have 

rh'^,k+iiSi) = = m^ fe+i(S'i). 
Furthermore, from (j46p and the definitions of 6*1, ^3 we have 



(57) 



(58) 



a{xk+i) + / qx^+iia)nn{da\xo 



, ^1, . . . , 



a;fc+i,s) 



ds = 00 a.s. 



for each Xfe+i S Fg S B{S\ (<5'2 U '^'i)), which implies that the calculations in (J16p and (P^ still remain valid 
with S being replaced by F5 e B{S \ {S2 U S"!)), and so (using ([56))') we have 



niZ,^,{Ts)=EZ 



J^qxAa)Trida\^,tk+i) 



m^,k+i{^s), 



for F5 G B{S \ {S2 U S'l)). As a consequence of this and ([55]) . we have 

m;fc+i(F5) = == ™;fe+i(F5) V F5 e B{S \ S2), 
which, together with ([M|). leads to the statement by induction. 



D 



Proposition 5.3 Suppose C'onditions \3.'2^ b) and \3.3\f a-d) are satisfied, and consider problem 0). Furthermore, 
we assume ri{x, a) < for each i = 0,1, ... ,N, and Si\ Si = 0. Then for each feasible policy ir G TIctmdp 
satisfying VF(7,7r, ro) > —00 and relations [Jl\ ) and i48^ , there is a Markov policy a = ((T„)n=o,i,... G Hsmdp 



such that W{-f,Tr,r,)^E^ ^^=0 e" "- c.(:£„)+9^„(a„) 



^-I2i = "(^i)^ 



ri(xn ,a„) 



for each i = 0,1, . . . ,N. 
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Proof. Let a = (an) G Hsmdp be the Markov policy from Lemma [531 By ([59)) and (H]) we have 

oo „ 

W{^,Tr,n) = V / r,(x,a)M^„(da;,da) 

= E(/, r,(x,a)M;„(dx,da) 

+ / ri{x,a)A.q^{dx,da)+ I ri(a;, a)M^„(da;, da) i , (59) 

JS2XA ' JsaxA ' J 

since Si\Si^ 0. By Lemma [Ql wc have, M^_„(S'i) = 7\7^_„(S'i) = 0. Also it follows from Lemma [Ql ((39|) and 
El) that 



M^^nidx, da) = M^.^{dx, da) = {q.^{a) + a(x))M^_„(dx, da) 
on 8(8 \ S2) X S(^), for all n > 0. Thus, by ^ we have 



M^(7,7r,r,) = V|/ r,(i, a)M;^„(dx, da) 



r, (x, a)M;,„ (dx, da) + / — ^{%^^M".„ (dx, da) }> , 
S2XA Js^xAO^l^J+Qxia) 



y(r nix, a) J ,^a ^^ ^^) 



S2X 



'\^T\ M nidx, da) + f /f!'°l , M;jdx, da) 



:^oJsxAa{x)+q^{a) ^ 



where the second equality follows from the facts (recalling the convention that § := and x 00 := 0) 
M^JSi xA)= Ml^{Si X yl) == 0; 

""^^^'"^ -M^Jdx,da)= f /;^^'°\ ^ a„ida\x)m%idx)=0- 



S2xAaix)+q^{a) ^'^ ' J S2XA (^(x) + q^^a) 



n{x,a)Nq^{dx,da) = E; 
S2XA 



I{Xn e S2}e-^o'-^^^^^'niXnJ*{x„))dt 



0, 



which hold because of P^ . Lemma IS^T c). and P5)) -([M 1) . Thus the statement is proved. □ 



The statement of the next lemma is obvious when Condition 13. II is satisfied, which is, however, not required 
in Theorem 13.41 On the other hand, it validates the relevant results in [5] and [33] (see Proposition 15 . 1 f b.c) ) 
used in the proof of Theorem 13.41 below. 



Lemma 5.4 Suppose Condition \3.SY a-d.f) is satisfied, and that ri{x,a) < for each i = 0,1, . . . , N. Then 

a(l]+g%) «s "Wer continuous m {x,a) £ K for each i = 0,1,2,...,N, and ^'['l^^f^lT^ ~ ^^S+tla) *« 
continuous in {x, a) G K for each bounded continuous function f on S. 

Proof. Recall again that the convention of [| := and ^ := +cxd if a; > (or ^ := — cx) if a: < 0) are 
adopted throughout this article. Let i = 0,1,. . .,N be arbitrarily fixed, and below we thus write ri(x,a) as 
r{x, a) for brevity. Since r{x, a) < is upper semicontinuous. Lemma 7.4 of [3] asserts that there exists a 
nonincreasing sequence of bounded continuous functions gn^x, a) on K such that lim„_s.oo gn{x, a) = r(x, a) for 
each (x, a) G K. Moreover, by inspecting the proof of Lemma 7.4 of U, it can be taken that gn{x, a) < for 
each n. Since a(x) + qx(a) > 0, we see that !'"^^f\ , 1 < and decreases to n^'") as n — !► cx) for each 
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[x, a) d K (recall that if r(x, a) = 0, then gn{x, a) = for each n). Furthermore, it is easy to check that — cxd < 
'^inf(i:,a)eK{5"(a;,a)} < a(x)+qAa)+^ - ^ (I'ccalling that 5„ is bounded) and that c(^)'^^^^(°)^i is continuous 
in {x,a) G K for each n. Thus, by Lemma 7.4 of [3], n^.a) ^^ upper semicontinuous in (a;, a) G K. As for 
the continuity of J^^^lv ^uC ~ a(x)+ (a) ^'^ ('''' ^) ^ -^ with an arbitrarily fixed bounded continuous function 
/ on 5, one can show as in the above that /.(/(^)--^P^^sjl/g^l})^(rf.k.a) ^^^ /,(-/(y)-su^P,,,{|/yi})g(dy|x,a) ^^^ 

both upper semicontinuous in {x,a) G K. Since JgSup^i^g{\f{x)\}q{dy\x,a) = 0, this shows the continuity of 
f^f{y)q{dy\xa) ^^^ ^^^ statement follows. D 

a[x)+q^[a) ' 

Proof of Theorem \3.4\ Lemma l^T^ and Proposition l5.3l established above assert that any reasonable policy for 
the CTMDP problem can be replicated by a policy for an SMDP. On the other hand, one can easily modify the 
reasoning in the proof of Lemma H31 to show that for any policy a G Hsmdp, there exists a policy tt G Hctmdp 
such that 



M^„{dx,da) 

J- y<j.u.,u,Li.j — j^-- — - 

^' qx[a) + a[x) 



MZ^{dx,da) 



for each n = 0,1,..., where MI^^{dx,da) is defined by ([M| . In comparing with ([5^. this asserts that any 
policy for the SMDP can be replicated by a policy for the CTMDP. Keeping this in mind, the rest of the proof 
proceeds with exactly the same argument as in the relevant proofs of Theorems 13.11 13.21 and 13.31 Note that 
the relevant results of O and [331 (see Proposition 15. iT b.c)) |f| are still applicable due to Lemma F5. 41 and the 
discussion immediately above it. Since the involved modifications are rather obvious, we omit the details to 
avoid repeating. D 

6 Examples 

In this section, we consider specific examples illustrating the technical roles played by our conditions in the 
proofs of the main statements as well as the applications of the obtained results. Additionally, Example 16.31 
further shows the insufficiency of the class of embedded Markov policies (and in particular stationary policies) 
for CTMDPs with nonpositive reward rates, see Remark l6. II 

The proofs of Theorems 13.11 [3^ and l373l are all based on (|25)) . When Condition 13. II is violated, the following 
Example 16.11 shows that (|25p could fail to hold. 

Example 6.1 

Consider the CTMDP with S = {0, 1}, A = {a*} (the dummy action), g({l}|0,a*) = 1 = qo{a*), qi{a*) = 0, 
N = I, ro(0,a*) = 0, ro(l,a*) = 1, ri(0,a*) = ri(l,a*) = 0, a(0) = a(l) = 0, and 7({0}) = 1. Evi- 
dently, Condition 13.11 is violated (at the state 1). Let a policy n G Hctaidp be arbitrarily fixed. Then 
under the standard convention that x oo := as followed in this paper, we see MI^ ^{S x A) = so 

that EZolK o^i'^i+qlU ^Ud^^da) =0^^ = W{^,n,ro) but EZo Ik ^71^1) ^^^^^ = = 
W{'j, 71", ri). In fact, one can also see that Lemmas 14. 21 and 14.31 both fail for this example. Indeed, for any policy 
n G ILcTMDP and n' G Hsmdp, M^,i({1} ^ {«*}) = ^^ ^"^({l} x {«*}) = 1> while, if a(0) = a(l) = 1, 
then Af^^({l} x {a*}) = ^ = il/^j({l} x {a*}), which is as desired. Indeed, the first equality follows from 
^7,i({l} X {a*}) = ^^•^.il'S' X A) = EJ^[e~''^], see ([HI), and the second equality is by the definition of the SMDP 
and the fact of M^'i({l} x {a*}) = EJ^' [e-^^], see ^. D 

The last observation in the previous example suggests the question about whether Condition 13.11 can be 
relaxed to the weaker condition that qx{a) + a{x) > for each x £ S and a G A{x). The next Example 
shows that the answer to this question is negative in general. 

Example 6.2 



*For part (a) of Theorem 13.41 now one should refer to Theorem 16.2 in 1331 or Proposition 15 . 1 f b) instead of Theorem 15.2 in 
| 33| as in the proof of Theorem I3.2f a) . 
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Consider the CTMDP with S = {1,2}, A = A{1) = [0, cx)), A{2) = {0} (i.e., state 2 is uncontroUed) , qi{a) = 
q{{2}\l,a) = e"", 92(0) = 0. Let 7({1}) = 1, a(l) = and a(2) = 1. Let us fix a poUcy tt defined by 
7r({a}|l,i) = 7ro({a}|l,i) — I{a — t}. Indeed, this is a correct definition of a pohey since the CTMDP admits 
only no more than one jump, and the state 2 is uncontrolled and absorbing. Clearly, Condition 13. II is violated 
becasue (71(a) > is not separated from zero while a(l) = 0. On the other hand, for each x G S and a G A{x), 
a{x) + Qxio,) > 0. Under this given policy, J. gi(a)7r(da|l,i) = e~*, and thus straightforward calculations give 



Ml,i{l} X A) 



El 
El 
El 
El /{xo-l} 



/ /{xo = 1} / qxQ{(^)Tr{da\x,t)dt 
Jo J A 

f I{xo = 1} / q^,ia)<da\xo,t)P^{ei > t)dt 

Jo J A 

/{a;o = l} / g:,„(a)7r(da|a;o,t)e--/'o-/'^«-o(")'^(''''l^"^'')'^''dt 

J A 



1 



-/(,*e "ds 



1 



^ 1 = Mi^o({l} X A), 



for any policy a for the SMDP corresponding to this CTMDP. Thus, Lemma W?]\ and (|16p do not hold for this 
example. □ 

By the way, we mention that the above CTMDP does not satisfy the conditions in [T^ . 

The idea of the proofs of Theorems 13.11 13.21 13.31 and 13.41 is based on the reduction of the CTMDP to an 
SMDP, which is in turn equivalent to a CTMDP, see Theorem 14.11 and Proposition 15.21 The next Example 16.31 
further shows that if Condition 13.11 is violated while A is not compact (thus the conditions of Theorem 13.41 are 
not violated), then there exists a policy for the CTMDP whose performance vector cannot be replicated by any 
policy for the CTMDP and SMDP, i.e., the natural extension of the reduction (or say transformation) method 
proposed in [111 [30] and ^ and ^ fail to hold. 



Example 6.3 (Inapplicability of the transformation method when A is noncompact) 



Consider the CTMDP as in Example O but with a(l) = a(2) = and A = A{1) = A{2) = [0, 00). Let iV = 
and ro(l,a) = — e^" and ro(2,a) = for each a & A. We fix the policy tt as defined in Example 16.21 so that 
/A'?i(a)7r(da|l,i) 
policy TT, we see 



/^ro(l,a)7r(da|l,i) = e *. Then with the initial distribution 7({1}) = 1 and under this 



M^(7,7r,ro)=£;r 



ro(^t, a)'!i{da\u!, t)dt 



El 



'Ut 



> 



*dt 



where the third equality is due to the fact P^{9i = 00) = e 
action space A is finite and thus compact, then P^{9i = cxd) 
each a G ^, we have that for each policy (f) G Hdtmdp 

■r-^ ro(a;„,a„) 



e: 



^a(x„) + g5„(a„) 



"^ < 1, see Example 16.21 above. By the way, if the 
= 1. On the other hand, since .1°^ , '"'I ■, = —1 for 

' Q(l)+gi(a) 



< -1. 



tionary policy tt G ^ctmdp, one can easily show that VF(7, 7r,ro) 
the first equality follows from the fact that PI^{9i = cxd) 



Thus, sup^gnoTji/DP El Yln=o a(a!)+gf„"(a„) I < '^{Ii'^^to), and the transformation method is not applicable 
in this case. □ 

Example 16.31 incidentally reveals the following interesting but important observation. 

Remark 6.1 (InsufRciency of the class of embedded Markov policies) In Example 1 6. 'A for any sta- 

-1 = ^ [E^ n J'frH J . Indeed, 
1 under any stationary policy, whereas the second 
equality is by the definition of the DTMDP. Furthermore, since this CTMDP admits no more than one jump, 
it also shows that any embedded Markov policy for the CTMDP can be replicated by a policy in the DTMDP. 
Thus, Example 1 6. S^ also shows that the class of (randomized) embedded Markov policies defined in Section \^ 
which is larger than the class of stationary ones, is insufficient for CTMDPs with total undiscounted reward 
criteria. To the best of our knowledge, this observation is new for CTMDPs. 
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Example 6.4 (Inapplicability of the transformation method when 5i \ 5*1 7^ 0) 

Consider the CTMDP with 5 = {1,2}, A = {1,2}, qi{l) =0 = g({2}|l,l), qi{2) =g({2}|l,2) = 1, 92(0) = 0, 
ro(l, 1) = -0.0001, ro(l, 2) = -1, ro(2, a) = -1. Let a(l) = 0, and a(2) = 1, 7({1}) = 1. Note that Si \ Si = 

Under any pohcy tt such that the action 1 is in use during the interval [0, 10), and the action 2 is in use after 
that (by the way, such a policy is called switching in [9]), then it is easy to see that —00 < W^(7,7r, rp) 7^ —2. 
On the other hand, for the corresponding DTMDP, under any policy a, we have E X]^o ''" 



i(a;7i)+iji:„(an) 



is 



either —2 or — cxo, since ^° ,1-. = —00, ^°\'„? = —1, and J^°\ , ^ = —1. Thus, we see that the transformation 

' 9l(l) ' 9l{2) ' Q(2)+(32(a) ' 

method fails to work in this case. □ 

Next, we formulate an example, which cannot be covered by the current literature on CTMDPs, whereas 
our optimality results are applicable. This thus illustrates the contributions of the present work. 

Example 6.5 (Explosive non-absorbing constrained CTMDPs with undiscounted rewards) 

Consider a controlled birth-and-death process with the state space 5* = {0, 1, . . . }, the action space A being an 
arbitrary nonempty finite space, and the admissible action spaces A[x) = A for each x € S. The transition rates 
are given by q{{x + l}|a;, a) = bx^a), q{{x — l}|a::,a) = ax{a)I{x > 1}, and qx{a) = ax{a)I{x > 1} + bx{a) for 
each X E S. We suppose that ax{a) > and bx{a) > are continuous in {x, a) E S x A. Let A^ = 1 be arbitrarily 
fixed, and we define the reward rates by rQ{x,a) = —I{x g S} and rj{x,a) = —I{x = 0}, which are assumed 
to be continuous in (x, a) G S x A, so that with the initial distribution 7({0}) = 1, —W{0, n, rg) represents the 
expected total time up to the explosion moment, and ^W(0, tt, ri) is the expected total time spent at the state 
zero before the explosion moment. 

Let a* be some fixed action in A such that under the policy ip*{x) — a* for all a; G 5*, it holds that 

Po^'(too<oo)>0 (60) 

(we will give specific system parameters that adm.it such a policy shortly below). Suppose 

^ff,(a*)<cx3, (61) 

x=0 

where a* is fixed as above, and for each a E A we define 

1 



50(a) = 
and 

9x{a) 



60(a) 



111=0 bk{a) 

for each x > 1. Let di < — X]^o 9x{o-*)- According to a well known result by Dobrushin, see Corollary 1 in [35l 
p. 191], ([M]) implies W{0, f*,ro) > —00. It is also shown in Example 1 in [351 P-191], see also the discussion before 
equation (35) in [3S1 p. 192], that (|5T|) implies W{0,(f* ,ri) = —J2'^=o9x{^*) ^ d-i, where the last inequality is 
by the definition of di. Consequently, the policy f* is feasible and with a finite value. Thus, for this example, 
if (j60[) and (j61[) hold for some action a* G A, then by our Theorem 13.41 there exists a stationary optimal policy 
for the constrained CTMDP problem (P. 

Below we specify the system parameters of the general controlled birth-and-death process described above, 
where (|60p and (pT|) arc satisfied for some a* G A. Let A = {1,2,..., M} with M > 2 being arbitrarily fixed, 
Ox (a) = 3;^° > and bx{a) — x'^''{x + 1)^" > for each x E S and a E A, which arc continuous in {x, a) E S x A. 
Then, according to Theorem 2.2 in p. 100 and the example 3 in p. 102 of 0, for the deterministic stationary 
policy (p*{x) = a* with a* = 1, it holds that Pq (ioo < 00) > 0. Furthermore, trivial calculations result in 
^^=0 9x{0'*) — ^^=0 fa l^l^li < oo- Thus, (|5ni) and (pT)) are verified for a* = 1. By the way, in this specific 
example, one can relax A from being finite to an infinite compact space such as [l,Af]; it is easy to see that 
Qxio.) > 1 for each x £ S and a £ A, verifying Condition 13. 11 and one can refer to Theorem 13.31 for the existence 
of a stationary optimal policy. 
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Remark 6.2 The CTMDP in Examvle \6.5\ is non- absorbing, with total undiscounted reward criteria and con- 
straints. In addition, it is explosive, so that the Lyapunov-type condition ensuring the non-explosiveness of the 
CTMDP under all the policies, see C'ondition \S.2\. is not satisfied. Therefore, the CTMDP presented in Example 
\6.5\ cannot be covered by the entire current literature, whereas the optimality results obtained in the present paper 
are applicable. On the other hand, since the performance of the CTMDP is measured by the expected total time, 
it would not be appropriate to introduce positive discount factors at all. By the way, to the best of our knowledge, 
this is for the first time that a CTMDP problem exhibiting all those complicated properties is formulated. 

7 Conclusion 

In conclusion, for possibly non-absorbing CTMDPs in Polish state and Borcl action spaces with state-dependent 
discount factors which could be zero-valued, we provided general sufficient conditions for the existence of a 
deterministic (resp., randomized) stationary policies for the unconstrained (resp., constrained) optimization 
problems, where the optimality is out of the class of history-dependent policies. In particular, our results 
apply to CTMDPs with unbounded transition and reward rates. Our proofs develop the idea of reducing 
CTMDPs to DTMDPs by comparing their occupancy measures as demonstrated in [HUH], which only focus on 
standard discounted problems. However, we emphasize that this is far not a straightforward extension, since the 
occupancy measures for CTMDPs considered in this paper and the corresponding DTMDPs can be different. As 
a fact of matter, we indeed show that the transformation method could fail to work for undiscounted CTMDPs. 
For this reason, another contribution of this paper is that we provided new and weak conditions to validate the 
transformation method. Finally, a new example was formulated, where the CTMDP is explosive, non- absorbing, 
constrained and with total undiscounted reward criteria, to illustrate the applications of the obtained results. 
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